BUREAU  OF  RESEARCH  AND  SERVICE 
Celloge  of  Education 
University  of  Illinois 
Urbana,  Illinois 


A rONSIDI^RATIOI-:  OF  INFORM^-TIOA'  TlffiORY  ANT  OTTLITY 
TITFORY  AS  TOOLS  TOR  PSYCHOMETRIC  ER0BLE:« 


Lee  J.  Cron>^ach 


Technical  Report  Nuraber  1 
under  Contract  N60ri-0711i6 
vith  the  Office  of  Naval  Research 


Colle.'^e  of  Education 
University  of  Illinois 
Ur tana,  Illinois 


No''em'')er , 1953 


A COr;SIEER.'.TICM  CF  INFORi.l'-.TION  THEORY  AND  UTILITY 
THEORY  AS  TOOLS  FDR  PSYGHOIiFTRIC  PROBLEMS 

L--;e  J.  Crcnb?ch 


Technical  Report  Hur.be r 1 
under  C.~n+ ract  H6Cri-071i;6 
•'.vith  the  Office  of  i'.aval  Research 


College  of  Education 
University  of  Illinris 
Urbana,  Illinois 


Movenber,  1953 


CONTENTS 


Introduction  1 

I.  MeasureHv  : as  a Communication  Process  5 

Shaiuion's  model  for  the  communication  system  5 

Tests  viewed  as  communication  systems  8 

Assignment  of  persons  to  categories  as  the  aim  of  testing  9 
General  meaning  assigned  to  the  term  "test" 

Use  of  categories  in  reporting  test  results 
Summary  of  the  test  as  a communication  system 


‘Wa;<’s  of  using  communication  theory  11 

Implications  of  the  communication  analogy  12 

II.  Measures  of  Uncertainty  cud  Information  in  Terms  of  Message 

Length  15 

A derivation  oj.  Shannon’s  measure  of  unccx-tainty  15 

Degree  of  confidence  15 

Toe  standard  item 

Standard  items  required  to  reach  certainty  18 

Toterpretation  as  sequential  analysis 
Interpretation  in  terms  of  coding 
Interpretation  as  log  confidence 


Cautions  in  empx.^,ylng  H to  measure  uncertainty 


Irifc-.u^  l“!i  as  the  reducation  of  uncertainty  2it 

Measure  of  residual  imcex-tainty  after  testing  2h 

The  confidence  continuum  and  the  measure  R 25 

Exhaustiveness  and  depen'’ =ibility  27 

The  index  of  exhaustiveness  27 

Th®  index  of  dependability  27 

Significance  of  the  formulas  28 

Non-syrametric  validity  relations 
Relations  involving  a fallible  criterion  33 

Application  of  the  formulas  Mo  nrdftred  scales  36 

The  rectangular  distribution  36 

Tire  normal  distribution  37 

Runder  assumptions  of  normality  38 

Summary  of  major  implications  Ul 


U3 


III. Measures  of  Uncertainty  and  Information  in  Terms  of  Correct 
Decisions 


k measure  of  average  uncertainty  U3 

Previously  published  formulas  for  measuring  dis- 
crininating  poner  Lh 

Geometric  interpretation  as  a dispersion  measure  Ui| 
Residual  uncertainty  c.fter  testing  U5 

Gain  in  average  confidence  (information)  U6 

infoimation  yielded  by  a series  of  independent 
items 

Exhaustiveness  and  dependability  U9 

Significance  of  confidence  formulas  ?0 


IV.  An  Introductorj'  Sta cement  of  Utility  Hieory,  Giving  its  Im~ 
plicat: ons  regarding  the  Information  Formulas 


51 


OU.-S.J.  MliOUXJf  pX 

Basic  data  required  5l 

Transition  matrix 

T- 1 . , . . 1.  • _ _ i ...  * __ 

i-iv  oxuouxuii  jutfi^rxA 

Interpretation  matrix 

Calculation  of  utility  52 

Reviev7  of  the  average  confidence  formulas  5U 

Review  of  the  average  log  confidence  formulas  5U 

V.  Analysis  of  a Psychiatric  Screening  Test  by  Utility  Tlieory  56 


Correlational  validity 


r' 

5 ; 


Utility  Analysis  58 

Introdi’ction  of  evaluations  58 

Determining  ent/'ies  for  the  evaluacxon  matrix  59 

Likelihood  ratios  as  a basis  for  cutting  scores  59a 

Choice  of  cutting  score  60 

Opt_.;.'  1.  strat-eg;'’  at  high  and  low  E.R,  6l 

Comparison  wich  conclusions  of  correlational 
analysis  61 

Utility  curves  for  various  strategies  6l 

Decisions  vrlth  the  test  compared  to  a nricri  de- 
cisions 61 

Strategics  to  be  coinpareu  oia 

Results 

Limited  strategics  6le 

Reduction  of  costs  62 


Conclusions 


63 


References 


IIITRODUCTION 


In  any  field  of  Investigatior.,  the  adoption  of  a neva  jiiathe^.£tical 
r-odel  often  rerr-.its  one  to  investigate  questions  vnhich  had  been  over- 
looked or  had  been  incaoable  of  treatment  under  the  model  formerly  employed. 
Lany  invest ii'at^'r"  '^ave  found  the  comrunication  nodel  introduced  by  Sharmon 
(26  ) a stimulating  conceptualization  and  point  of  departure. 


It  has  seemed  to 
larly  useful  for  test 
gested  this,  but  ques 
mentally  nem  results, 
in  a fresh  perspectiv 


several  vriters  that  this  approach  vrould  be  oarticu- 
theory.  both  Hick  ( 19  ) and  biller  (22  ) have  sug- 
tion  Viiether  comrunication  theory  rill  yield  funda- 
or  rill  contribute  chiei'ly  by  presenting  old  results 
c.  Either  of  these  contributions  might  be  significant. 


T]-:e  present  miter  began  to  examine  testing  p'CobloTr.s  in  terras  of 
Shannon's  inform.ation  theory  in  19U9.  It  becamie  clear  that  this  net?  model 
had  several  values.  A rs  rticularly  striking  feature  is  the  generality 
v;hich  permits  one  to  formulate  statements  about  tests  rhich  apply  to 

devices  rhich  c--tegorize,  devices  rbich  measure  along  a single  numerical 
scale,  devices  rhich  yield  scores  along  several  dimensions,  and  even  those 
instrum.ents  rhich  lead  to  verbal  descriptions  of  the  person  tested.  This 
general  approach  suggests  that  re  inquire  how  much  useful  information  a 
testing  nrocedure  tells  us.  T.ie  older  approaches  to  teat  evaluatiui  aslc 
how  accurate  the  test  is  in  making  any  single  measurement  and  prediction, 
and  n-'^ke  no  provision  for  evaluating  as  a whole  the  iniormation  noLeld  of  a 
test  ■•.hici':  provides  many  r.easures  or  predictions. 


A preliminary  report  in  1952  (ll)  presented  a statement  of  uesbing 
problei.s  in  tbe  lanpucige  of  information  theory.  In  biis  report,  no  aL- 
ternot  xvas  made  to  ro9xar.:ine  the  theor-/  developed  by  Sharmon;  rathr  his 
forrrialas  were  taken  over  bodily  and  trars  i sted  into  implications.  Crit- 
icisms of  the  preliminorp;  re  lort  -ere  solicited,  and  these  dre’"f  attention 
to  the  fact  that  the  model  itself  requirei  C'-reful  reconsideration. 

"'hile  the  informational  approach  ie’  tc  nerv  insicht  regarding  tests,  there 
was  reason  to  tiiink  that  alternative  ’mathematical  treatmiants  'nvoiving 
n'"  tho  sope  .:o; ' copt 'i o Hs  dcserved  censideration,  and  might  be  even 
more  ap;'ropriate.  Under  suepoort  fcoi;.  the  Office  of  i'iaval  Resea-rerr 
undertook  to  study  the  basis  of  Shannon's  formulas,  to  examine  other  pos- 
sible formulations  of  the  testing  problem,  and  to  identify  implications 
for  test  analysis. 


The  comparison  of  various  approaches  has  made  clear  that  while  the 
Shannon  measure  has  interesting  pr sport ies,  it  does  not  correspond  perfectly 
to  the  requirem.ents  of  psychometrics.  AssuinptiouS  are  embodied  in  the 
Shannon  model  T'h  ich  are  not  appropriate  for  test  analysis.  At  the  same 
tiir.e,  '"e  have  identified  even  more  places  fnan  before  where  the  examin- 
ation of  a test  in  terms  of  the  comm.unI cation  model  suggests  couoepts  or 
questions  which  are  worthj'-  of  serious  thought. 


2 


An  alten'istive  approach  r^hlcn  '“e  shall  discuss  in  this  report  is  an 

infornation  i^nalysis  based  on  average  probabilities  in  contrast  to  Shannon's 

use  of  an  average  of  log  probabilities.  Th-is  analysis  includes  and  extends 
the  lorrula.s  for  discriminating  pover  of  a test  which  have  been  proposed 
several  times  in  rev-ent  years.  This  series  of  formulas  has  approximately 
ir.e  sa;v.e  implications  as  those  based  on  the  Shannon  measure,  even  though 
the  r-athematicai  formulas  differ,  "c  find  this  second  method  of  analysis 

Superior  in  some  respects  to  that  of  Shannon but  uiifortunately  it  too 

has  serious  limitations  for  test  analy-sis. 

The  first  roduct  of  our  work,  then,  is  that  we  have  founa  out  what  not 
to  use  and  •■’hy  not.  It  is  valuable  to  disolooe  the  limitations  of  the 

Shannon  formulas  or  the  discrimination  formulas  for  osychone tries,  since  the 

findings  rani  investigators  in  this  f’ild  against  using  these  schemes  of 
ana  iys  is  ijidiscriminately. 


Our  work  has  also  indicated  what  ’•’ill  be  a Viore  suitable  line  of  attack. 
The  croblem  of  the  tester  is  to  e’l'-eloy  whatever  testing  tir^e  is  available  in 
whatever  wav  -;iil  most  iitprove  the  decisions  made.  The  value  of  tne  tests 
is  to  be  judged  by  the  improvement  in  decisions.  Therefore  a ''ubil..ty  theory" 
is  called  for.  The  two  s''-3tems  of  inforr.ation  analysis  we  have  considered 
are  in  essence  special,  c.ases  of  utility  analysis  which  invoke  somewh.Tt  limit- 
ing assumptions,  "e  expect  the  utility  theory/’  to  provide  a proper  formal 
demonstration  of  the  conclusions  suggested  by  th.e  information  analogy,and 
the  utility  approach,  being  more  general,  should  also  lead  to  conclusions 
not  covered  by  the  information  treatments.  i;.creover,  the  explicit  utility 
theor^-'  shcula  have  suostar.tial  value  i.n  organizing  and  cl?.r.:'"''”ing  test  theory. 


At  this  i'oInL,  .•■•e  have  not  studied  tne  utility  approach 
arc  not  re'’dy  to  present  it  in  final  form.  In  order  to  intro 
vide  a brief  s.ketch  of  the  essential  uoiiC<j[.ts.  ultimately,  • 
place  the  present  report  '"itri  a discussion  organized  around 


roughly,  .and 
ce  it,  we  pro- 
expect  t'-'  re- 
lity  th>.ory. 


"rile  exploring  i;tility  theory,  '"e  ’vvorked  through  va’i'ic  ~ avamrles.  One 
of  these  workca  exam.'des  is  ^resorted  b.^cause  its  cone.  3 ions  are  in- 

terestinn  in  theziselvcs.  The  orese  i-’i-'cn  provides  a simple  .’'.lustration  of 
utility  analysis  and  the  tiTues  of  result  tnat  can  he  expected  from  xt. 


developed  during  the 
cos sidle  significance 


which  follovv  ai-e  a record,  unereiore,  of 
past  ye.ar.  ’ e ?'.ay  com.ment  briefly'  on  the 
of  each. 


he  th:;.nking 
nature  and 


Section  I describes  the  communication  model  and  shovjc  that  the  test  i-’ 
be  viewed  as  a communication  sy'stem.  This  is  a purely  verbal  pr.^senta-'  , 

It  sets  forth  a series  of  analogies  '.'.liich  are  useful  >i';  thinking  i'nt''iitJ..  n,iy 
about  tests,  regardless  of  the  mathematical  system  adopted.  It  •Ji, , ’uhe 
basis  for  apolr'ing  raatherriatical  iriformatxon  models.  This  section  ■ -'oen- 
tially  a restatement  of  materials  given  i.c  the  1952  report. 


J 


Section  II  presents  the  formulas  for  analyzing  Information  according  to 
Shan;';on’3  asour.ptions,  and  clarifies  the  meaning  of  those  formulas.  "hile 
vjo  do  not  recommend  that  tesos  be  treated  by  these  formulas,  many  of  f'em 
surgest  important  ways  of  looting  at  tests,  ’’e  expect  ulti';ately  to  develon 
coir.oarable  tut  r;ore  adequate  formulas  within  the  utility  model.  Of  particular 
importance  in  Sectioi.  II  are  the  examination  of  testing  as  a problem  in  se- 
quertiai  analysis  (follcv;ing  bvaris  (In}}  and  the  consideration  of  the  con- 
cepts of  exhaustiveness  a^id  denendability. 


Section  III  presents  the  formulas  for  analyzing  information  by  arith- 
metic averages,  '"hile  these  forrmulas  are  also  to  be  replaced  by  a more 
adequate  system  eventually,  they  are  somewhat  closer  in  conception  to  utility 
theorv  than  Shannon's.  Persons  interested  in  previously  proposed  evaluations 
of  tests  in  terms  cf  their  " iiscriiminating  power"  will  find  that  Section  III 
clarifies  the  significance  of  those  formulas. 


Section  IV  discusses  briefly  the  essential  concepts  of  utility  theory, 
together  with  the  reasons  for  preferring  this  schema  to  the  fonr.ulaticns 
presented  in  II  and  III.  This  section  indicates  the  line  cf  attack  we  in- 
tend to  follOT  in  further  studies,  and  provides  a foundation  for  understand- 
ing Section  V, 


inte 

ticn 

over 

may 


Section  V is  a study  bv  means  of  utility  theory  of  one  particular  test, 
tided  for  psychiatric  screening  of  recruits.  This  provides  a der.onstra- 
cf  the  tyre  of  conclusions  utility  theory  can  yield,  which  would  be 
■looked  in  conventional  validity  analysis.  The  procedures  of  this  section 
be  ap.plied  to  other  screening  teists. 


These  reports  are  rrim;arily  designed  to  share  our  present  thinking  with 
others  interested  in  these  problems,  rather  than  to  proviae  a final  state- 
ment of  conclusions.  Criticisms  of  the  concepts  developed  in  this  report 
will  be  welcomed; 


Our  present  thinking  has  depended  to  a ver'y  large  extent  upon  the  stim- 
ulation provided  by  coiriments  cn  the  19p2  reoovt  from  the  following  persons: 
Yehoshua  Dar-Hillel,  Massachusetts  institute  of  Techno  loco:;  Raymiond  H.  Burros, 
Training  Research  Laboratory,  University  of  Illinois;  A.  S.  C.  Lhrenberg, 


: ^ e - rr»  V.  •*  • 


>f  London:  George  A.  terguson,  i..cGili 


University;  ■'%  Garner,  Institute  for  Cooperative  Research,  achns  iiepkins 
University;  'i . E.  Hick,  Applied  Ps  ychology  Research  Unit,  Psychological  Labor- 
atory, Cambridge,  Massachusetts;  F.  t.  Lord,  Educational  Testing  Service; 
'‘■'illiam  LcGill,  i assachusetts  Institute  of  Technology;  Srockway  iv.ckillan. 

Bell  Telephone  Laboratories;  George  Hiller,  Massachusetts  Institute  of  Tech- 
nology; iredericK  iPsteiler,  Department  of  Social  Relations,  Harvard  'Univer- 
sity; Henry  'duastler.  Control  Systems  Laboratory,  University  of  Illinois; 

D.  o « oi;iun,  v^oxiegfe  oi  «jOiuCciujOii,  Lnj.v*ei3xtv  oi  IliinciSt. 


Particular  credit  for  the  work  presented  here  should  go  to  Dr.  Eugene 
Burdock,  now  with  Carnegie  Corporation.  He  served  as  Research  A.ssociate  with 
uhis  project  in  iybU-iyh3,  ano  rendered  valuable  assistance^  He  was  author 


or  co-author  of  .':any  file  meinoranda  which  represented  v-ay  stations  toward 
the  present  paper.  Dr.  Goldina  Gleser  of  ’"ashinptcn  University  'eedical 
School,  Department  of  iJeuropsychiatr2'';  has  helped  substantially  in  the  pre 
pa  ration  of  the  present  report. 


V 


5 


I.  L^EASUREIEFT  AS  A C0?.:MTTNICATT(.'1  PROCESS 


Tests  have  customarily  been  described  i-U  language  derived  from  the  meas-=- 
uremant  problems  of  the  physical  sciences.  The  very  use  of  the  term  neasure- 


ment  sumuests  Uiat  the  function  of 


psychciogical  or  educatjonal  test  is 
Employing  physical  measurement  as  a 


osyc no -1.0  02.3 d has  gone  cn  to  emoxorv*  suen  asscciateci  eoneepus  as 


analogous  to  that  of  the  oetex'  stick. 

scale,  erx'or  of  measurement,  and  reliability  as  principal  constructs  f'^r  the 
evaluation  of  tests.  Recently  there  has  been  increased  avrarsness  that  many 
instrunents  for  meosureinent  in  the  behaviora'l  sciences  lack  the  properties 
of  physical  measuring  sceles.  Coombs,  among  others,  has  suggested  that,  tests 
can  ce  more  soundly  inte:rpreted  if  other  models  are  used  as  a basis  for  meas- 
urement theory  (?  )• 


An  alternative  model  also  offered  by  the  physical  sciences  suggests  nev/ 
ways  of  looking  at  tests.  The  communication  engineer,  dealing  with  problems 
in  electronics  particularly,  has  made  extensive  use  of  a schematic  ir.odel  of 
the  ccmmunicat j cn  syustem  and  of  a mathe’"?.ticai  theory  of  conmojnication.  It 
is  possible  to  describe  te.sts  also  as  communication  systems.  The  communi- 
cation analogy  throws  new  lig'it  on  the  function  of  tests,  and  leads  to  new 
way^s  of  evaluating  the  power  of  any^  test. 

Shannon’s  Lodel  for  the  Communication  System 

Although  an  excellent  summary  of  oormiunication  theory  is  already/  pro- 
vided in  a recent  article  by  George  i'  iller  (22),  a brief  account  of  the  com- 
munication model  is  of.feren  here. 


The  mathematical  theory  of  communication  is  most  closely  associated  with 
Claude  Shannon,  he  developed  the  sym^temi  while  analyeing  problems  of  coding 
in  cryptanaly^sis,  and  later  aoplir-d  it  to  problem's  of  transmission  in  electri* 
cal  and  electronic  cornunicaticn  systems  (25,26).  It  ha.s  been  emolcyed  in- 
dens  ndently'  b’^  ' iener  in  the  study  of  servo-r’e  chan  isms,  and  has 
had  a substantial  n'ornuer  of  applications  to  all  t^njes  of  dynamic  o.nenomsna. 
Indeed,  its  protean  adaptabilitv  Is  both  the  greatest  advantage  and  greatest 
disadv.^ntage  of  th.e  m.odel.  The  info rr.iut ion  concept  can  be  used  to  describe 
al-’ost  any  process,  but  has  so  many  different  possible  interore  tat  ions  that 
it  is  hard  to  focus  on  the  exact  m.eaning  of  such  a concept  as  "amount  of  in- 
forir;atiori  tra.nsmitted."  Quastler,  who  has  edited  a sin?;posiun;  or.  applications 

+ So  V>or*  r*^r  +•  ro  1*^  n ^ 


1-/1  wv./-. 


. 4.  U ^ vv  ' "1  \ 

arvt^o  oiixo  u • \c.u*  p_'-v/4,±y 


"information  Thecr/"  i.s  a name  remarkably  ant  to  be  ndsunderstood. 

The  theory, ' deals,  in  a qua.ntitative  w'ay,  with  something  called  "in- 
formation" which,  hov/ever,  has  nothing  to  do  with  meaning.  On  the 
other-  hand,  the  "information"  of  the  theory  is  related  to  such  di- 
verse activities  as  arranging,  constraining,  designing,  determining, 
differentiating,  messaging,  ordering,  organizing,  planning,  restrict- 
ing, selecting,  soecializing,  specifying,  and  systematizing;  it  can 
be  used  in  connection  •'•’ith  all  opterations  which  airn  at  descreasing  such 
quantities  as  disorder,  entropy,  generality,  ignorance,  indistinct- 
ness, noise,  randomness,  uncertainty,  variability/,  and  at  increasing 
the  amount  or  degree  of  certaiirty/,  design,  differentiation,  distinct- 
iveness, .iridividuallzation,  inf  or  .aation,  lawfulness,  orderliness. 


particularity,  regularity,  specificity,  uniqueness.  All  these 
quantities  refer  to  seme  difference  betveen  genern.l  and  specific; 
in  this  sense,  they  can  be  measured  with  a common  yardstick. 
Furthermore,  measures  which  are  appropriate  exist,  due  to  the 
developments  of  Information  Theory. 

A communication  system  involves  a transmitter,  a receiver,  and  a channel 
connecting  them  (Fig-ufe  l) . The  conaxonl cation  is  sent  because  the  receiver 
is  •oncertain,  and  an  appropriate  communication  should  reduce  his  ’uncertainty. 
If  we  think  of  the  teletype,  as  used  to  transmit  stnek  quotations,  we  can  see 
the  essential  features  of  the  system. 

(1)  Snsemble  of  possible  ' .sages . The  transmitter  can  choose  among  a 
great  variety  of  possible  messages.  Some  of  these  are  more  likely  to  occ’ur 
than  others,  but  until  the  message  is  selected,  there  is  some  uncertainty 
as  to  what  it  will  actually  be.  The  receiver  can  expect  that  today's  quo- 
tation on  U.  S.  Steel  will  be  within  a few  points  of  yesterday's,  b’ut  he 
cannot  be  certain  of  this,  nor  can  he  predict  the  precise  ^’rotation  that  vll'i 
be  transmitted. 

(2)  True  message,  or  input.  V.'hen  the  market  closes,  the  true  quotation 
(one  of  the  possible  messages)  is  available  at  the  transmitter  but  the  receiv 
er  is  uncertain  what  it  is.  This  message  put  into  the  system  is  called  an 
input.  For  any  input,  some  output  is  received  at  the  other  end  of  the  system 

(3)  Channel . The  channel  includes  all  the  equipment  used  in  conveying 
the  message.  .Actually,  the  channel  can  be  divided  into  a chain  of  smaller 
commvjiication  systems  (operator's  vis’ual  system,  operator's  motor  system, 
teletypewriter,  cable,  etc.),  for  each  segment  of  the  channel  has  its  own 
input  and  output.  But  we  lose  nothing  by  regarding  a sequence  of  these 
miniscule  systems  as  a single  communication  channel. 

The  chariiiel  has  its  own  response  properties.  For  any  given  input,  there 
is  some  di&uribution  of  outputs  which  the  channel  may  yield.  The  channel 
filters  out  some  parts  of  the  input  and  fails  to  transmit  them:  it  may  in- 
troduce errors  of  various  sorts  due  to  mechanical  failures,  friction,  and 
Interference . 


ity. 

place 


(4)  Channel  capacity.  Each  communication  channel  has  a limit  or  capac- 
Only  some  specified  number  of  transmissions  per  unit  of  time  can  take 
, eit.her  beca’use  of  aji  agreed  rate  of  transmission  or  because  the 


■nHtrc  n 1 

jtr — •/  -fc 


of  the  sysbem  prevent  more  rapid  transmission. 


(5)  Noise . When  a particular  message  is  sent  several  times,  there  may 
be  variation  in  the  output.  Thus,  when  the  letter  L is  typed  at  the  trans- 
mitter, interference  (static,  or  cmss-talk  between  circ’uits)  may  cause  K 


i 


! 

I 


& ’n 

O 2 

C 4.  - 

c/2 


k 

I 


C Co  ‘r- 
'c  n? 
tn  -P 

dj  J3 


0> 

-P 

•P 

j 

iO 

§ 

u 

Eh 


7 


'll 


t, 

'-^ 

fe"  C' 
^-1 


. ^ 

^r:  O 

t;  53 

f. 


p 

"Ct 

0; 

4^ 

o 

CD 

*r^ 

•,l 

Cm 

rp 

b 

U 

f.' 

CO 

A 

c 

T1 

05 

a*  0)  -P 

k 

G 

•::  w 

d 

v*:  r!  c. 

•V3 

cU 

V . t.: 

P 

v: 

••-  ’ n ;U 

r5 

CO 

ir 

a. 

b£) 

« 

*ti 

p 

vi 

CT5 

o ^ 

It, 

■V 

C3 

2 

•M)  V' 

u'‘ 

^ •+■> 

p 

cc  ::J 

•ri 

? 

cc  :2 

r<4 

Cv5 

c 

to  P* 

f-i 

CO 

#-4 

W Q. 

C-* 

to  C 

o 

IH 

W 

£-< 

« C 

O'  hM 

•rS 

a>  5—. 

0)  Hi 

ri 

u 

* J-;  V ^ 

• -■-« 

:-3  ^ 

Ih 

o 

o 

-p 

H 

w 

#H 

f-. 

o 

fX 

iC'4 

G 

T? 

■'••4 

: 

— n 

Fif^LU’e  1.  '!.'•';!  Coiriinuaio-ition  fjo-.lel 


sometimes  to  be  printed  ax,  the  receiver^  Such  variation  inxrodueed  in  the 
channel  is  ca 1 led' noise. 

(6)  Receiver-  The  receiver  T.ay  be  r€;gardscl  as  a person  who  wishes  to 
make  some  decisions  ('i.e.,  accept  or  reject  some  hypotheses)  and  who  desires 
to  reduce  his  uncertainty  in  order  to  make  better  decisions.  Information 
regarding  toda'*'s  stock  ciuotation  is  needed  a oasis  for  tniriorr.w's  de- 
eds ions . 


Before  the  message  is  transveitted,  '.ve  consider  the  receiver  as  having 
sor;c  initi':].  or  a priori  uncertainty,  ffter  the  transmission,  the  recriver 
has  a messc.ge  containing  some  degree  of  error,  great  or  negligible.  There  is 
no  mossible  harm  la  such  a.  misnrint  as  "U  S STBBK  U7'i”  which  can  be  instantly 
corrected.  It  would  be  very  important  that  the  numerals-be  transinitted- with- 
out error.  So  long  as  error  is  possible,  the  receiver  has  some  a posteriori 
nreertainty  after  receiving  the  message. 


Tests  Viewed  as  Comm.urii cation  S;'/stems 


The  person  who  gives 

O**  X-  V?  .-m  .-w  m ^ w-  M 

-L.  WllVJ.  9 *.0 


a psychoiogicai 


or 
- ^ 


educational  test  is  in  the  posi- 


come  more  certain  before  making  decisions.  The  employer  judging  an  applicant^ 
the  clinician  diagnosing  a patient,  the  teacher  evaluating  a pupil’s  rro- 
ficie'nc':  — each  assumes  the  existence  of  some  true  classification  or  de- 
scrirticn  for  the  subject  which  he  wishes  to  determine  as  certainly  as  he  can. 
The  testing  procedure  is  a device  for  making  a more  certain  decision  as  to 
the  probable  true  m.essage  or  description  of  the  individual,  ’'hile,  with  the 
teletype,  the  true  message  is  actually  knovm  to  someone  when  the  transm.ission 
is  sent,  the  true  message  desired  in  testing  is  ordinarily  unknovm  and  not 
directl'-'  observable. 


Consider  the  e::,ployer  who  has  ten  vacancies  xo  fill,  and  fifty  appli- 
cants. A priori,  without  further  infonr.ation,  he  could  only  select  men  by 
chance.  lie  is  uncertain  whom  he  should  eir.ploy.  To  gain  information  he  gives 
each  man  a ten-day  tryout,  T'uis  is  his  inf errr^at ion-retting  device.  At  the 
end  of  this  time,  he  m.av  have  observed  sufficient  differences  to  warrant  a 
definite  decision  ns  to  whic?'.  ten  ^c;■:  in  r.iie. 'Jrhe ’.d.ght  be  certain  about  eight 
men,  and  have  another  group  of  twelve  who  seem  equally  qualified  for  the  two 
remaining  vacancies.  The  trveut  has  decreased  his  uncertainty.  A test  might 
have  been  used  in  place  of  the  tryout,  and  it  ”'ould  similarly  have  reduced 
nis  uncertainty  to  some  degree. 


All  elements  of  the  communic.ation  system  are  to  be  observed  in  testing; 


(l)  Ensemble  of  possible  messages  ° Criterion  distribution.  The  "pos- 
sible messages"  are  the  true  descriptions  or  scores  vrhich  rdght  arise  in  the 
population  being  studied;  the  receiver  v;ants  to  Know  which  of  these  applies 
to  the  person  under  study. 


s 


9 


(2) 

criterion 


Input*CritPrion  classification, 
classification. 


Each  person  tested  has  a ti-ue  or 


(3)  Channel.  The  entire  operation  01  testing,  fjhich  may  include  obser- 
vation, scorinr,  and  interoretation  of  perfornsnee,  ma"'  be  regarded  as  part 
of  the  commuriioatior.  system,  l^ach  of  these  is  an  opportunity  for  information 
to  be  lost  (discarded  or  distorted),  and  it  mav  sometimes  be  profitable  to 
study  one  specific  lir.K  in  the  chain  separately. 

(i;)  Channel  capacity,  "'hile  cl'.e  engineer  crdinarily  thinks  of  capacity 
in  terms  of  iniorrr.ation  transmitted  -‘cr  unit  time,  the  tester  may  often  oe 


mere  into  rente''',  ii':  the 


vield  per  test  administered.  Eacn  test  has 


a certain  capacity,  determined  by  its  discriminating  po'v-er. 


(5)  -'!oi3e=Error  in  testing.  T''-nical  errors  which  cause  a single  input 
(ti'ue  classification)  to  yield  different  outputs  (obtained  classification)  are 
sampling  errors,  fluctuations  in  tlie  s'utject  over  time,  chance  responses  due 
to  guessing,  and  error  Li.  scoring. 


( 

v.o; 

score  or 


?.eojl ver°Test  user.  The  receiver  arrives  at  som;e  final  obtained 
obtained  classification  on  v/hich  a decision  is  based. 


Assignment  of  Persons  to  Categories  as  the  Aim  of  Testing 


The  oreceding  section  has  been  q'iite  general,  racing  no  assumpticris  re- 
garding the  form  of  the  r.sssa.ves  being  transmitted.  These  might  be  descrip- 
tions such  as  are  derived  from  a prci.eotive  test  protocol,  categorical  desig- 


nations such  -as  " lire  rate" 
IQ,  The  mathematical  theo 
of  qualitative  uncrdcred  c 
matheiiiatical  formulatr'.on, 
gori cal  c lass if ications , 
be  extended  to  cor.tinuoiis 


- "illiterate",  or  numerical  scores  such  as  an 
ry  of  CO  r.vun.'Lcation,  ho-'vever,  is  stated  in  terns 
ategories.  To  examine  tests  in  the  light  of  Shannon 
•/e  shall  f"  ini:  of  jur  criterion  "messages"  as  ente- 
It  can  bo  readily  shovn  that  this  forraul.ation  can 
(numerical)  scales  and  to  complex  descriptions , 


General  keaning  Assigned  to  the  Term  "Test" 


The  usage  of  such  terms  as  "test"  and  "measuroixent"  varies  among  differ- 
ent v'riters,  oerne  '.vould  restrict  the  terri  "test"  to  devices  yielding  quanti- 
tative statements  about  individual  differences  , Others  would  restrict  the 
■word  "measurement''  to  orocesses  v;hich  locate  individuals  on  a scale  having 
demonstrably  equal  intervals,  or  might  even  require  that  the  scale  have  an 
absolute  zero.  ^ ince  our  interest  is  to  develop  a highly  general  theory,  ap- 
plicable to  the  widest  variety  of  studies  of  individual  differences  among 
persons,  ’,ve  will  net  apply  such  restrictions. 


V'e  snail 
as  a "test". 


speak  of  any  procedure  employed  to  compare  two  or  more  persons 
This  means  that  our  discussion  will  oe  relevant  to  many  de'vices 


1 


TThich  are  not  essentially  quantitative,  and  which  would  not  be  called  "rr.eas- 
urer.ents'’  in  the  ordinary  senses  ol  that  tern.  An  interview,  lor  exarr.r  ie,  or 
a clinical  counseling  session,  permits  an  employer  or  a counselor  to  make  yids- 
nents  ahont,  the  individual  he  talks  with,  Thi.s  judgment  may  De  expressed 
in  a complex  verbal  statement  which  is  formally  '".■ite  unlike  the  numerical 
result  a test  -would  yield.  The  purposes  of  interviev;  and  test  may  be  Piuch 
the  same,  however;  namely,  to  perr.it  a ;*.ore  ccnfidont  decision  about  one  or 
m.ore  questions.  It  is  therefore  reasonable  bo  inquire  w.hich  of  the  two  pro- 
cedures is  rr^f“r=bie,  in  any  riven  problem.  It  may  be  necessary  to  develop 
our  method  of  analysis  somewhat  differently  for  vciiiuas  of  tost  report, 

but  this  does  not  make  less  valuable  the  overall  conception  vrithin  which  these 
detailed  models  fit. 


It  is  to  be  noted  that  a test  is  defined  as  a orocess  for  jn.akin.g  compar- 
isons. Perhaps  we  could  equally  ”e.ll  regard  a test  ss  a process  for  describ- 
ing any  one  individual.  Tlie  statemeni  that  "John  is  male"  is  significant, 
however,  only  because  it  is  possible  for  John  to  be  non-male.  There  is  no 
•uce,  from  the  point  of  view  of  maicing  decisions  about  John,  in  reporting  a 
tr-iism.  The  purpose  of  any  test  is  to  find  out  ho-r  one  individual  differs 
irom  others:  chus  ail  testing  i.s  ai"!  attempt  to  make  discriminations  between 
individuals.  Sometimes,  it  is  true,  our  result  is  expressed  in  a statement 
about  John  alone.  But  the  statement  "John  is  six  feet  tall''  describes  Jotui 
on  a scale  which  has  meaning  cnlv  because  other  persons  or  objs.its  have  dif- 
ferent heights.  The  user  of  a quantitative  statement  brings  to  bear  a frame 
of  reference  for  interoreting  the  scale,  and  this  frame  of  reference  arises 
out  of  exccrionce  with  other  individuals.  Hence  any  result  from  a test  is 
an  implicit  comparison  or  discriminction. 

Use  of  Categories  in  Reporting  Test  .Results 

A test  is  u.sed  to  aid  in  raking  decis.-'‘ons  or  statements  of  the  form  "S 
belongs  in  category  1'.  ror  the  psychologist,  3 will  ordinarily  be  a person 
or  animal,  but  the  seme  .model  holds  ”'ner.  a sociologist  reports  a character- 
istic of  a noighoci’hood,  or  when  an  educator,  let  us  say,  judges  a school 
Duilding.  In  oehaviorai  testing,  typical  statements  naming  a categor';  ex- 
plicitly are  the  following: 


"S  is  '-aranoid  schizophrenic" 

"3  is  a poor  risk  for  medical  school" 

"S  is  a superior  officer" 
has  a speecn  de.fect" 

"o  is  in  a state  of  high  ai'iXiu by" 

"5"  needs  further  work  on  the  addition  combinations" 

That  these  are  categorical  statements  is  .instantly  seen  by  noting  that  the 
word  "not"or"ho"  inserted  in  the  statencut  causes  it  to  describe  a different 
eategorv  of  persons  from,  whom  S is  being  differentiated.  Sometimes  the  cate- 
gory named  is  one  of  a whole  set  of  possible  categories  (e.g.,  the  various 
categories  used  to  classify  patients);  s-^mictimes  there  are  only  two  categories 
in  a set  (good  risk,  poor  risk,  for  example). 


11 


Categories  are  likewise  used  when  a result  is  expressed  in  nur^erical 
terrr?,  even  though  we  think  oi  numerical  scales  ss  continuous  rather  than  dis- 
crete. "John  is  six  years  old"  places  Jorm  in  a category  of  "persons  having 
CA  between  p-6  and  6-6",  under  the  usual  definition  of  this  class  interval. 

If  the  ::;e a -e ror  • . nt  is  more  pr''.lse,  the  categor'/'  oecones  smaller.  "John’s 
four-''sar  grade  average  was  3»l523"  locates  John  in  a category  "3»  15225  - 
j.ip2o?". 


V'hen  '.ve  ’nave  a set  of  categories,  it  may  be  important  to  recognize  wheth- 
er the  categories  are  unordered,  ordered, or  nartiaily  ordered;  whether  the 
categcrv  s stem  is  unidimensicnal  (e,gc,  red,  black)  or  multidimensional  (red- 
odd,  red-even,  ...):  and  whether,  if  os'dered,  any  statement  is  to  be  made 
about  the  magnitude  of  the  scale  interval  represented  in  the  category  ( 8 ). 
These  differences  n:a'-  influence  the  way  in  ?;hich  the  communication  or  utility 
model  is  to  be  applied.  Shannon  states  the  problem,  however,  in  terms  of 
system  of  k categories,  without  restricting  the  category  system  to  any.of 
the  above  t’pes. 


''"e  have  stated  that  the  tester  desires  to  make  decisions  about  an  indi- 
vidual. In  order  to  do  so  as  wisely  as  possible,  he  viculd  like  to  know  what 
categories  this  individual  falls  into.  The  categories  that  concern  the  tester 
will  of  course  be  tnose  relevant  to  his  croblem.  The  tester  has  some  a priori 
exp-^ctation  about  the  subject’s  classification.  If  this  a priori  knowledge 
is  too  uncertain  to  be  a basis  for  action,  the  tester  seeks  to  reduce  uncer- 
tainty by  means  of  the  testing  device.  If  the  test  is  useful,  it  permits  the 
tester  to  assign  the  person  to  a category  and  to  iriake  decisions  about  him 
with  greater  confidence  than  he  cad  before  testing.  It  is  consistent  both 
’•'ith  coronon  usage  aiid  with  comr.:unication  theory  to  say  that  the  test  has  con- 
tributed "information"  about  the  individual.  The  person's  proper  classifi- 
cation is  an  unknown  "message"  to  be  discovered.  After  administering  the 
test,  we  have  an  "obtained  message"  which  ray  be  an  accurate  basis  for  clas- 
sifying him,  OI-  which  m.a'-  be  to  somie  oegree  untrustworthy. 

The  essential  nroblems  of  test  th.eory  are  (l)  to  find  annronriate  meas- 
ures for  the  increase  in  confidence  or  goodness  of  decisions  permitted  by  a 
test,  and  (2)  to  determine  hov;  tests  can  be  designed  to  be  most  effective. 


Ways  of  Using  Communication  Theory 

Communication  theory  (information  theoiy)  can  be  used  in  two  ways. 

First,  it  offers  an  analogy  which  re  can  use  to  describe  tests  and  testing 
problems.  Second,  it  offers  a mathematical  system,  containing  definite  post- 
ulates, formulas,  and  theorems.  These  two  uses  make  different  demands  upon 
the  theor’>  '"^nd  it  is  well  to  distinguish  them. 


Analogies  in'  thinking  are  helpful,  indeed  indispensable.  The  Lewinian 
school  of  thc-ught.  for  instcance,  communicates  effectively  with  such  terms  as 
"force",  "barrier",  and  "distance".  This  permits  a reader  to  visual'Ize  a 


12 


conclusion  Trhich  rould  othsnrise  bn  unduly  abstract.  The  concept  of  homeo- 
stasis or  d;;marcic  equilibrium,  borrowed  from  cherriistry  and  physiology,,  is 
similarly  used  to  enpncsize  certain  aspects  of  adjustive  action. 


Analogy  is  primarily  a contributor  tc  creative  thinking  - to  perceiving 
problero  in  a ne-'v  light,  or  to  imagining  ne'v  solutions.  Analogy  is  much  less 
useful  for  criticizing  hypotheses  or  for  ast:-blishing  precise  results.  Anal- 
ogies are  rarely  perfect,  and  any  failure  of  the  event  under  study  to  corre- 
spond to  the  model  means  some  degree  of  error  in  the  conclusions. 


'''hen  a model  is  used  as  a oasis  for  rigorous  deduction  or  for  rrjthematical 
treatmer.t  of  a problem,  careful  scrutiny  of  its  details  is  necessairy.  £ither 
■^e  must  be  certain  that  the  rode!  fits  the  problem  in  all  details,  or  we  must 
locate  the  precise  points  of  non-correspondence  and  ascertaijs  just  how  much 
effect  these  "errors'-  have  upon  the  result  we  are  ueriving. 


Information  theory  has  b en  used  in  both  these  ways,  i.e,,  formally  and 
informally.  In  the  informal  use,  the  investigator  tries  to  stats  his  problem 
in  irfcrrr.ational  terms-  This  may  clarify  his  rroblem  and  suggest  variables 
whose  effect  he  should  measure  in  his  experiment,  A formal  use  of  the  theory 
is  represented^  when  the  investigator  adopts  the  Shannon  formulas  and  computes 
his  results  in  terms  of  them.  Thus  Hick  ( 1?)  suggests,  as  an  example,  that 
the  optimium  item  difficulty  mav  be  determined  by  aximizing  the  rate  of  trans- 
mission of  information,  which  is,  under  certain  conditions,  a function  of  the 
item  difficulty,  Since  this  function  is  rather  flat,  any  failure  of  the 
formulas  to  fit  the  case  under  study  may  lead  to  a sizeable  error  in  deter- 
miniisg  the  ootim’mr:  difficulty,  befoie  v.e  can  accent  conclusions  from  any 
such  formal  use  of  the  Shannon  formula,  we  need  to  kno”'  whether  it  does  fit 
the  testing  prcbitrjii  oe  rfectly,  or  how  serious  would  be  the  effects  of  its 
failure  to  fit. 


In  the  ensuing  sections  present  the  Shannon  formulas  in  such  a way 
that  some  of  their  characteristics  are  apparent,  '"e  also  develop  a set  of 
reasonable  alternative  fon.ulas.  The  formula  mav  be  applied  formally  only 
in  situations  which  fit  the  underlying  model  in  all  details;  this  is  rarely 
the  case  in  problems  of  test  analysis. 


It  is  not  necessary  to  be  so  cautious  in  using  the  communication  model 
informally.  Used  simp''v  as  an  analogy,  it  is  remarkably  fruitful  in  suggest- 
ing ideas,  re  shall  point  to  such  inferences  as'we  go  along^ 


Implications  of  the  Communication  Analogy 


Two  simple  but  important  alterations  in  concepts  regarding  tests  result 
from  the  general  model  as  presented  in  this  chapter, 

0 ) value  of  a test  should  be  judged  by  its  ability  to  reduce  the 

a priori  uncertainty.  (Cf,]  o » p»  65),  This  seems  obvious  if  we  view  a test 
as  a communication. 


1? 


This  is  not,  hov.'ever>  the  concept  vojt  often  einnloyei  in  judging  tests. 
l’h°  test  has  alrrost  alvays  been  judged  by  e correlation  (or  the  like)  stat- 
ing the  validity  of  the  test.  A correlation  is  a measure  of  imorovement 
over  chance  decisions.  If  the  potential  user  of  the  test  is  already  able 
to  inferences  with  better  than  chance  accuracy,  using  vjhatever  infor- 

mation about  the  individual  he  has  prior  to  testing,  the  validity  coefficient 

overestii'ates  the  possible  contribution  of  the  test.  An  examnie  shov's  the 
signiiicsncc  of  this  distinction. 


’'  ithcut  giving  a mental  test,  the  school  can  orcbably  make  fairly  good 
estimates  of  pupils'  mental  a;.^s.  From  age,  grade  :ui  school,  and  the  tcach- 
p'*'s  ooinior,  the  score  on  a nrou’p  test  could  probcToiy  be  predicted  a priori, 
•with  only  roderate  error  in  most  cases.  The  contribution  of  the  test  should 
be  judged,  not  by  ability  to  reoort  sc'nolastic  ability  to  a receiv^^r  mhc  has 
no  knor:ledge,  but  by  its  ability  to  give  information  not  already  at  hand. 

The  mental  test  does  give  some  essential  new  information-  A eonventional 
coefficient  does  not  report  the  validity  over  and  beyond  a priori  kno-wledge; 
it  estimates  validity  ever  all  pupils  in  a broad  class,  crediting  the  test 
for  information  overiaoping  '«'hat  the  teacher  before  the  test  is  given. 

In  contrast,  the  TAT  and  the  like  are  regarde d^inaccurate  by  usual  standards. 
But  since  the  school  probably  is  coji.pletely  ignorant  about  pupils*  fantasies 
and  the  needs  triey  sigriif'.vu  whatever  valid  T.^.T  reports  is  nei'' 

information.  Hence  test  analysis  which  discounts  a priori  information  -will 
tend  to  encourage  the  use  of  tests  which  measure  things  not  no-r  kno'wn,  and 
to  discourage  use  of  tests,  however  ac  . urate,  *7hich  duplicate  information 
that  can  be  cleaned  from  records  alre  udy  at  hand. 


In  quite  a different  context,  "-e  may  point  to  the  many  current  studies 
of  the  effectiveness  of  the  clinician  in  making  diagnoses  or  discriptions. 

A comrr.on  uiaa  is  to  determine  ho-’  many  correct  judgments  the  clinician  can 
make  u.'on  the  basis  of  some  set  of  data,  compared  to  the  number  he  might 
make  by  chance.  One  might  nrovi.de  the  judge  -'ith  data  about  the  individual 
and  ask  him  to  predict  certain  aspects  of  the  individual's  behavior.  In- 
stead of  cemoaring  his  success  vith  chance,  vie  might  comnare  it  with  the 
judges'  success  in  oredicting  solel}'  from  kno'w ledge  that  the  individual  be- 
longs "bo  a certain  group  (e.g.,  patients).  There  is  evidence  that  the  clin- 
ician sometimes  does  bett'^r  when  making  the  latter  type  of  prediction,  from 
very  little  information,  than  -"hen  he  ui-ediets  from  more  complete  data  ( 17)  • 
That  is  to  say,  the  individualized  procedure  — whetaer  better  than  chance 
or  not  — is  noorer  than  the  prediction  based  on  a sterot;v’pe  of  S's  group. 

The  evaluation  of  the  clinical  procedure  should  be  based  on  improvement  over 
the  best  a priori  procedure,  rather  than  on  imorovement  over  chance.  Dailey 
has  shown  one  method  of  estimating  the  goodness  of  a clinical  procedure  by 
comparing  ■it  with  the  best  a pHori  estimate  ( 13  )• 

(2)  hultidirriensional  measures  can  be  compared  ■with  unidimensional 
measures.  In  previous  psycnomevric  anal,,  .is,  it  has  been  customary  to  judge 
a test  against  a single  criterion.  A multidimensional  test  may  be  treated 
by  multiple  correlation,  to  determjne  if  it  predicts  this  criterion  better 
than  does  some  unidimensional  measure.  The  question  is  rarely  raised,  ho^w- 
ever,  whether  the  raultidii^ensional  test  -which  can  be  intex’preted  so  as  to 
predict  many  criteria  gives  more  Infoi'maticn  than  some  unidimensional  test 
aimed  toward  just  one  criterion.  Some  mathematical  methods,  such  as  canonical 
correlation,  are  available  for  this  nroblem,  but  they  have  been  given  little 


lU 


application  in  psychone tries. 

In  communication  analp/sis,  t!ie  question  just  raised  is  a very  reason- 
able one.  Given  a certain  amount  of  transmjtteir  time  for  our  message,  -"’e 
right  either  seek  a highly  dependable  report  on  some  one  thing,  or  wt3  iuight 
prefer  to  have  less  complete  or  less  accurate  reports  on  many  things.  To 
take  a simple  example:  Late  on  Saturday  afternoon,  one  radio  listener  rdght 
prefer  a fifteen-mj,nute  sports  report  in  v;hich  dozens  of  football  scores 
vrere  read  off very'  little  information  about  each  of  maxw  separate  quest- 

ions or  uncertainties . Another  listener  '-vould  much  prefer  a fifteen-minute 

report  describii^g  a particular  gar;e  play-by-play much  information  on  one 

closely  related  set  of  questions;  no  information  on  other  questions. 

bemetines  noise  or  error  in  transmission  makes  it  very  irrportant  to 
send  just  one  single  message  mith  maximuir.  precision  (as  when  transmitting  an 
S-O-S,  together  with  the  ship's  position).  The  operator  sends  his  limited 
message  c er  and  over,  even  though  he  could  use  the  same  time  to  convey  add- 
itional (out  less  important)  iiiiormation.  These  examples  make  it  clear  that 
it  is  meaningful  to  ask  ’-hether,  in  a given  situation,  we  are  wiser  to  em- 
phasize breadth  of  coverage  at  the  risk  of  thoroughness,  or  vice  versa.  In 

One  of  the  major  problems  posed  for  test  evaluation  is  to  find  a proper 
formula  for  comparing  une  procedure  that  gives  moier  tely  dependable  pre- 
dictions cn  many  dimensions,  '^'ith  the  orocedure  '-'hich  is  ::ore  dependable  but 
less  comprehens ive j It  may  well  be  that  the  interview,  capable  of  touching 
on  dozens  of  dimensions  in  a half-hour,  is  frequently  a better  personnel 
classii'ication  procedure  than  the  test  which  answers  just  one  question  with 

i m n ^.Vto  OUT*  USS  cX*  j p f* 

p'ation  tneory  pcses  this  question,  to  answer  it  accurately  vd.ll  require  that 
re  develop  a rigorous  treatr;ent  using  utility  theory. 


15 


II.  IiEASURiS  OF  UTJCDRT.ATNTY  hI'ID  INFORivi-.TlON  Hi 


OF  i£SSAGi^  LSNGTH 


A Derivation  oi  Dhannon's  *.>easura 


of  Drcertainty 


Following  the  general  conception  of  a co*nr.unication  system  presented  in 
Section  I,  we  consider  a transmitter,  channel,  and  receiver.  The  data  avail- 
able at  the  receiver  are  used  to  infer  the  state  of  affairs  at  the  transmit- 
ter. ‘le  shall  adopt  the  following  terminology  and  notation: 


General 


Sign: 


1 r» 2 r“ 


in  Testing 


Notation 


Possible  states  at  the 
transmitter  (alternatives) 
allo;ved) 

Transm;issions 

Possible  states  at  the 
receiver 

(alteniatives  ailoT.'ed) 


Criterion  class if icatioas 
Persons 

Categories  in  which 
re  b u jl  v<s  o.  1 e i e po  r te  d 


jL  a,b.c.«.j.'\ 


S = A,B,Cj.,N 


M ri  H if 

•J  V/»  . • <t 


i'  J 


Any  person  S has  his  croper  or  criterion  classification  x-.  The  receiver  at 


^iiv 


no  has  certain  data  y about  S,  and  desires  to  infer  Xo. 


y'g  inciuQes 


's  * s 

whatever  rescenses  or  scores  describing  S are  available;  v/hen  the  tester 
first  approaches  his  oroblem  he  may  hove  no  differentiating  information,  in 
which  event  ;/  is  the  aai.^e  fnr  all  subjects. 


Using  vdiatever  data  y_  are  available,  a Judgment  as  to  the  proper  x_ 

will  ordinarily  be  made  with  some  degree  of  uncertainty.  As  more  data  are 
obtained  about  3,  the  receiver  hopes  to  increase  the  confidence  with  which 
he  can  infer  x . '7e  may'-  evaluate  a data-gathering  procedure  by  determining 

how  much  it  ii.i^xeases  confidence,  or  reduces  uncertainty',  ’'.'e  might  speak 
of  measuring  changes  in  "degree  of  certainty",  but -this  conflicts  with  the 
dictionary  usage  which  regards  certainty  as  all-or-none.  In  common  speech, 
one  does  not  re.ter  to  "degrees"  of  certainty/.  Therefore  we  shall  use  the 
terms  "degree  of  confidence"  (or  "confidence")  and  "degree  of  imcertaintyd' 

(or  "uncertainty")  as  antonyins.  The  precise  relation  between  these  two  will 
depend  on  the  formulas  used  to  define  them,  but  in  general  we  may  think  of 
degree  of  uncertainty  as  "1.00  minus  degree  of  confidence". 

Degree  of  Confidence 

V^hen  the  receiver  infers  that  x = a,  he  is  accepting  a hy^iothesis.  The 
probability  that  this  hypothesis  is  true  we  may  call  his  confidence  regarding 
that  hypothesis,  "e  shall  let  Conf  (a,  y ) denote  the  confidence  that  x_  = a, 

S 5 

for  a person  having  a given  y » 

Conf  (a,  y^l  = ?r{x^  = a |y  = 

s 


(1) 


16 


Table  1, 

> Specimen  Transition  katrix 

Outputs  (Responses) 

(y) 

& 

C 

a 

PVa 

P V / • •• 

Va 

, 

O 

o 

b 

Inputs 

cr 

^Vb 

p • « • 

y/b 

^/b 

1.00 

or  c 

Criterion  , 

Vc 

^ • 

■ j/c 

*>'C 

1,00 

• 

Classifications  , 

(x) 

p ./. 

p . _ 

D 

' 5/k 

P . 1 

'*yx  i 

1,00 

! 

1 

Pooled 

P ^ , 

p^ 

1 

' 1.00 

Y;e  T^ay  rsfard  -•  the  oi  ccability  that  y will  be  received  wher.  a is  trans- 

initted.  (If  there  is  variable  error  in  transmission,  p , will  not  be  1,00 

y / 3 

for  any  v. ) Sirdlarlv,  p / is  the  probability  that  this  belongs  to  a oopu- 

a/  y 

lation  of  responses  rocsivei  ’«rhen  a is  transr;.ittod  ir.any  ti  .es. 


The  response  characteristics  of  the  conrunication  charinel  define  a set  of 
Table  1 sho''''3  a "transition  "■atrix"  living  trese  values. 


probabilities  p / 

• y/x 

These  constitu’te  a statement  of  the  frequency  and  character  of  errors  asso- 
ciated rith  the  test  performance  of  persons  falling  in  any  criterion  class. 
Further,  if  we  are  dealing  with  a particular  population  of  persons,  this  pop- 
ulation is  described  'oy  a distribution  of  p.  over  categories,  Y'hcn  p is 

X ~ X 

knovsn.j  we  can  dotemjne  the  nroh-'doi  1 jties  d of  various  resnonseq: 

‘ y 


P “ 2 p = b P p , 
y X xy  X ' X y/x 


IP.  , w,  4.  U - 
i ui 


x/y 


P 

p 

■ y 


(2) 

(3) 


These  are  the  usual  equations  stating  relations  between  joint  and  conditional 
probabilities , 


17 


It  is  -well  to  note  at  this  point-  that  relations  to  be  developed  in  this 
Section  are  reversible.  ’%ile  ive  speak  of  a communication  channel  as  having 
a direction  ( transmit  ter- to-receiver),  ’"e  do  no  mathematical  violence  if  ■'ve 
reverse  the  direction.  This  makes  little  sense  in  physical  comimunication 
problems > where  we  visualize  a message  passing  along  the  channel  as  tim:e  pro- 
gresses, but  there  are  occasions  when  it  is  of  interest  to  view  a process  in 
reverse.  In  testing,  wc  shall  at  times  be  interested  in  ixiferring  (predicting) 
an  obtained  score  or  a response  x'rom  a true  score.  A suitable  substitution 
of  X for  j in  our  formulas  rakes  them  appropriate  for  this  reversed  inference. 


Novj,  applying  (2)  and  (3)  to  (1),  ■'..e  obtain  the  useful  computing  formula 


Conf  (a. 


’^Vg/a 


P,- 

j. 


(k) 


Shannon  poir.ts  to  the  desirability  of  measuring  the  confidence  (or  uncertainty) 
of  a receiver  averaged  over  an  entire  set  of  inferences. 


The,  standard  item.  Shannon  introduces  what  we  may  call  the  "standard 
mes=;age"  as  a device  for  expressing  degree  of  uncertainty.  The  unit  or  stand- 
ard for  measuring  information  is  the  so-called  bit,  and  a standard  message  is 
one  'vhich  convevs  "one  bit  of  information'' . In  thinking  of  testinff  problems. 


we  might  think  of  the  standard  message  as  a "standard  item",  ".liile  the  stand- 
ard binary  message  is  ordinarily  described  as  a message  which  divides  persons 
into  two  equal  categories  with  no  error,  we  shall  employ  the  following,  slightly 


ir.ore  general,  definition,  a standard  item  is  one  for  which  ^x/y  is  2 or  0, 

^'x 

fox-  eacii  X and  y.  That  is  to  say,  if  a report  from  a stand-ord  i.ter;  is 
to  whatever  prior  information  we  have,  we  can  reject  some  ir^cthescs  as  having 
zero  probability;  and  the  probability  of  any  other  hypothesis  being  true  is 
doubled  over  what  it  was  when  inference  was  mads  without  the  information  given 
by  the  item.  From  (3),  it  follows  that  Py/x  also  is  2 or  0,  for  a standard 

Py 

item. 


Table  2.  Illustrative  Transition  i atrix  for  a Standard  Item 

(Cel 

Outputs  (y) 


(Cell  entries  show  p , ) 

y/x 


, ( 


j 


X' 

J 


Inuubs 

(x) 


a 2p  0 

b 2p'  0 

g 2p,  2p„ 

h 2p  2p,J 

i 0 ‘ 2p' 


• • • 

m 0 2p/» 

n 0 2p^ 


k 0 0 


Q 

0 

# 

0 

0 

0 


0 


p 


r 


i 

i 


t 

» 


i 

I 

I 

I 

J 

I I.OO 


U.U 


In.  Table  2,  a transition  matrix  for  a standard  item  is  presented,  show- 
ing the  probability  that  any  v will  arise,  for  a given  x. 

standard  Items  Required  to  Reach  Certainty 

Shannon's  measure  of  uncertainty  can  be  shewn  to  be  flip  number  of  stand- 
ard items  per  individual  required  to  -'rovidc  conplete  certainty  regarding  all 
individuals.  In  the  general  communication  case,  it  is  the  average  niAreber  of 
standard  (binary)  symbols  that  must  be  received  to  provide  the  receiver  cora- 
nlete  certainty  as  to  the  traj'.smitted  message. 


Suppose  a series  of  stand.nrd  items  is  administered  to  a population  of 
indi’-iduals,  each  item  having  the  same  type  of  transition  matrix  and  each  in- 
dependent of  the  ethers.  Indepen ier.ee  must  be  defined  by  t.ne  following  co.n- 
ditions  ('•’here  1 ana  2 designate  any  two  items).  For  any  y and  x. 


^1^2/x  ^l/x  ^2/x 


v-,yo 

-L  €. 


(5) 


This  is  to  say,  the  items  are  uncorreiated,  and  unccrrelcted  witn  x held  con- 
stant. Each  item  measures  a sepiorate  portion  or  aspect  cf  the  criterion,  as 
in  that  sort  of  test  where  ’ve  seek  to  m.ake  correlation  bet'oeen  item  and  cri- 
terion ’•’ositive,  and  correlation  betv’een  items  zero. 


For  any  person,  the  series  of  item:S  generates  a.  series  of  responses  ■which 


By  definition  (l,',  the  confidence  in  classifying  S as  an  x 


after  t itemxS  ("t  messages  received")  is 

Coni  (x,  y^)  = 


(6) 


^xPy^/x 


V hv. 


(7) 


From  (5), 


The  radios 


1c,c  — — — 


Conf  (x,ry^)  “ — 

V 


^x^y^/x^y^/x 


Pv 

1 ■•'2  ... 


(8) 


, etc.  can  be  2 or  0,  Each  response  must  have  non-zero 


y-i 


probability  in  order'  to  have  arisen  from  the  given  x.  Therefore 

Conf  (x  y ) = p (2)^ 


(9) 


19 


'i'fiis  follovjp  from  the  definition  of  a standard  item. 


liow  our  quest5.cn  Is,  -^hat  value  of  t is  required,  to  i:iake  the  receiver 
coracl::tely  certain?  '''hen  x « a,  we  nay  denote  by  t the  number  of  standard 

cl 

items  required  to  make  Conf  (a,  y ) = i, 

V 

t. 


Pa  ' 2 


■ (10) 


0 “ log  p + 

“ a 


log  2 


(11) 


As  Shannon  points  out  (P6,p.U),  the  cnoice  of  base  for  the  logarithm  here  is 
completely  arbitrary.  It  does  not  affect  conclusions  from  the  theory,  and 
nay  be  regarded  as  a convention.  Hc-.vever,  if  we  choose  base  2 for  the  log- 
arithm, our  equation  siiriolifiss  to 


- ^0^2  Pa 


V 


If  information  is  transmitted  about  one  person  au  a tine,  we  of  course 

have  no  way  ox  transmittirg  (sav)  2.3  stand.ird  itcr.-^.  If  - log  o =2,3, 

a 

we  would  have  to  transmit  3 items  to  classify  a person  as  a with  certainty. 

In  general,  the  number  of  items  required  to  classify  an  individual  is  the 

inter er  next  larger  than  - log  p (unless  - log  p is  an  integer),  Tnat  is, 

s ^ 

■^■hen  iixioxuration  is  transmitted  for  just  one  cerson  in  any  item, 

- leg  p t < 1-log  n . 
a a ’ a 

In  a samiole  where  w is  the  proportion  of  persons  in  any  x category,  we 

5C 

may  let  f signifj"  the  average  number  of  items  required  mer  person. 


- Z w log  p <r  t - 2 w log  p 

XX  X XXX 


(13) 


As  N becomes  very  large  uhe  number  of  persons  for  whom  x is  the  true  clas- 
sification approaches  hp  , Iherefore, 

X 

- * P,  lo»  p,  J ^ n - s log  (111) 

It  is  of  inter: 3t  to  note  that,  for  any  value  of  a,  there  is  a limited 
number  of  items  that  t:an  s imxilter.eously  satisfy  the  independence  conditions 


(h).  Not  mor 


- .og  peer 


1 .1  kyO 


n Do  i*  0 iiiidl  - 


Interpretaticp  as  sequential  ^alysis.  Obtaining  certainty  by  means  of 
a test  may  be  thought  of  as  a orooiem  in  sequential  analysis.  Supoose  items 
are  administered  to  a person,  one  at  a time,  until  an  accurate  inference  can 
bfc  made  regarding  hjs  true  category,  '"e  would  continue  testing  any  person 
until  some  desired  level  of  confidence  is  reached,  at  which  point  we  would 


I 


20 


t 


make  a ;loii  refar^ing  him  and  nrocnad  to  the  next  nereon  (Uil*  •'©  "^oulVl 

r.  'obnMy  have  to  administer  more  iterrs  to  some  persons  than  others.  In 
qnalily  control  a testing  nrcoeiure  is  evaluated  by  deternining  the  average 
sarmle  number;  lb  is  the  average  number  of  objects  ’.‘’hich  must  be  tested  in 
order  to  arrive  at  decisio.;s  about  the  population  from  \";hich  the  objects  are 
sampled.  In  the  co:amunication  oroblem,  is  the  average  sarnnle  number;  i.e., 
the  average  rruiicer  of  standard  messares  reo.uirsd  to  reach  a desired  degree  of 
confidence  for  members  of  a.  ensemble  of  messages* 

if  v;e  are  milling  to  discontinue  testing  at  some  confidence  level  less 

than  1.00  (saycOj  then  the  required  number  of  items  t’  is  a simple  function 
of  t . ' 

C ' p .2''®  (15) 

a 

t ’ =*  t + log  C • ( 16 ) 

a a 

(tS  this  formula,  may  not  09  an  integer),  li  0*  is  the  same  for  alu  cat- 
egories. 

t’  = "t  + log  C (1?) 

This  demonstrates  that  an  additive  correction  makes  t a statement  of  the  num- 
ber of  standard  items  required  to  classify  persons  at  any  desired  confidence 
level,  '.’e  might  require  different  confidence  level.;  fcr  different  categories, 
larger  C'  attending  the  more  imortent  categories  ’vhere  me  C7ant  to  ininimiye 
risk  of  error.  If  each  x has  some  C' , 


t’  ■ t + Zp  log  C (18) 

UC  X 

Intern rotation  in  terms  of  coding.  Shannon  uses  information  theory  to 
develop  theorems  regarding  coding.,  ai.d  it  is  in  that  context  that  his  formulas 
have  particular  relevance,  /-.n''  message  to  be  transmitted  is  encoded  to  fit  a 
particular  channel.  Onlv  a one-to-one  encoding  is  involved  in  the  tjc^ewriter, 
’?rhere  the  "channel"  has  one  s>nbol  for  each  letter.  In  oifr.er  channels,  such 
as  telegraoh,  telephone,  or  teletv’ue-'riter,  the  transmitted  symbol  is  encoded 
into  dots-and-dsshes,  vibrations,  or'  other  ne’v  s'-miibois.  in  order  to  measure 
the  transmission  capacity  and  rate  of  s'/sters  usinj,  various  sumcbol  systems. 
Shannon  evaluates  each  one  b,v  determining  hov  each  one  compare?;  ’.'-ith  a noise- 
frf;e  binary  transmission  system. 

A binary  system;  I'.as  tivc  states:  "on-off",  "push-pull",  etc.  The  two 

the  binary  system  transmits  messages  of  the  form  101,  or  00011,  or  1101.  The 
electronic  computer  uses  a binary  system  of  this  t'/pe.  If  . a priori,  1 and  0 
are  equally  likel'.’-  at  a given  instant,  the  transmission  of  either  s^gnbol  ivith- 
out  error  ("noise-free"case ) doubles  the  coiifidence  of  the  receiver.  A 
standard  message  (i.e.,  one  bit)  has  b.;en  transmitted. 


21 


Any  complex  message  can  be  encoded  in  binary  form.  We  may  sgr-ee,  for 
instance,  that  "red-even"  vlll  be  transmitted  as  "11,"  or  that  the  letter  "g" 
will  be  represented  by  11001.  Any  set  cf  sixteen  equally  likely  alternatives 
can  be  encoded  perfectly  into  four  binary  symbols  (e.g.,  1101).  In  general, 
2^  equally  probable  messages  can  be  encoded  into  messages  of  t binary  digits = 

If  the  transmissions  ej*e  not  equally  likely,  it  is  economical  to  use 
shorter  codes  for  the  most  common  messages.  Thus,  consider  how  we  might  en- 
code four  alternatives,  knowing  their  probabilities  of  occurrence. 


Alternative 

Px 

Cods 

No.  of 
digits 

Weighted 

digits 

A 

.50 

0 

I 

.50 

B 

.25 

10 

2 

.50 

0 

.125 

110 

3 

.375 

D 

• 125 

111 

3 

.375 

Weighted 

avg. 

ulgl'tS  p63T 

T 

1 7C 

^ • t , ' 

The  message  ACB  would  be  encoded  011010.  Each  possible  code  has  a unique 
intei'pretation.  The  transmission  1100100101110  is  equivalent  to  CABABPA.. 
Thus,  in  thirteen  standard  messages  we  convey  seven  letters,  using  an  average 
of  1.86  bits  per  letter.  If  we  iiad  coded  each  alternative  in  a tvro-digit 
pattern,  our  average  would  be  2.0.  By  using  fewer  digits  for  A than  tor  C 
and  D,  we  have  a relatively  efficient  code.  Over  a longer  series  of  trans- 
missions in  which  each  letter  appeared  in  the  specified  proportions  p^,etc., 
the  average  number  of  bits  required  per  letter  would  drop  to  exactly 
-2  P log  p (1.75) . 

Coding  of  one  letter  at  a time  will  not  be  so  efficient  as  this  if  the 
p..  are  not  integral  powers  of  0.5  as  in  the  foregoing  example.  It  is  pos- 
sible, however,  to  encode  patterns  of  letters,  and  this  permits  economy  of 
message  space.  For  example,  compare  single-letter  and  two -letter-pattern 
ceding,  when  the  p^,,  for  two  alternatives  are  .64  and  .36. 


Alternatives 


P 


Best  single 

letter  code  Digits 


Weighted 

Digits 


A 


.64  1 


1 


.64 


B 


.36  0 1 


• 36 


VJeighted  avg.  digits  par  letter  1.00 


Suppose  A and  a are  independent,  so  that  the  sequence  AB  has  probability 
P^Pb,  etc.  Then  w’c  may  consider  sequences  of  two  letters. 


22 


irnative 

:uence 

P 3 

Cat  code 

Digits 

Weighted 

digits 

AA 

.41 

0 

1 

-4l 

AE 

ro 

10 

2 

.46 

•25 

no 

/T  r\ 

• c-r 

BB 

.13 

111 

3 

.39 

Sum 

1.95 

Digits 

per  pattern  transmitted 

1.95 

Digits 

per  letter 

transmitted 

• 975 

Coding  of  tvG  letter  sequences  periiiiis  a saving  of  2.5^  in  average  message 
lerigoh.  There  is  still  seme  insfi’iciencv  . hovover.  Reporting  hy  the  first 
digit  that  the  sequence  is  AA  or  not  A/,  conveys  less  than  one  bit  of  informa- 
tion because  pyjj^  ^ .50.  By  building  longer  sequences,  we  can  arrive  at  one 
which  is  equal  to  a power  of  0.5,  to  any  desired  degree  of  approximation. 

Shannon  demonstrates  that  if  indefinitely  icr.g  sequences  may  be  encoded, 
it  is  possible  to  transmit  messages  in  exactly  - 2p  log  p standard  items  per 
symbol,  on  the  average.  This  minimum  number  of  items  is  his  measure  of  un- 
certainly, in  terms  of  message  space  required  to  produce  certainty.  He  calls 
this  Hj^. 


= - 2 p^  log  p^  (19) 

For  the  probabilities  given  above,  is  .9*+3  bits  per  letter. 

The  testing  problem  is  not  truly  comparable  to  Shannon's  encoding  prob- 
lem for  which  he  uses  (19)  and  derivative  formulas.  Given  a person  who  be- 
longs in  a category  havi.cg  probability  .25,  it  is  theoretically  possible  to 
find  two  independent  items,  each  dividing  the  group  in  half,  so  that  their 
combined  information  identifies  the  category  the  person  belongs  to.  Ue  have 
essentially  "encoded"  the  information  in  terms  of  responses  to  two  items. 
Actual  test  items  contain  error,  but  this  can  be  taken  into  account. 

What  is  not  reasonaole  in  testing  is  Lo  thinlc  of  encoding  two  or  ~~re 
persons,  i.e.,  a sequence  of  persons;  it  is  not  possible  to  "send  two  or  more 
persons  through  the  channel"  at  once.  Since  such  encoding  of  patterns  is  not 
possible,  formula  (19)  is  only  approximately  a statement  of  the  number  of 
items  ~eq"ircd  to  classify  a person  on  the  average.  Formul®  (l4)  holds  pre- 
cisely, but  is  not  manageable.  We  shall  employ  (19)  a?  ? basis  for  subsequent 
formulas,  therefore;  with  the  warning  that  these  results  are  interpretable 
only  approximately.  Since  H 1 > ^ > H,  results  from  subsequent  formulas  may 
be  in  error  by  as  much  as  one  bit.  No  exact  statement  may  be  made  regarding 
the  error  introduced  into  oi'x  ratio  formulas  by  this  approximation.  The 
fail'cre  of  Shannon's  formula  to  have  an  €;xact  meaning  for  the  testing  problem 
means  that  his  theorems  regarding  capacity  nnd  encoding  camot  be  regarded  as 
necessarily  true  or  meaningful  in  psychometrics^. 


23 


Interpretation  as  log  corifidence . /-another  interpretation  of  Shannon's 
meatiUx’e  re  suit  from  considering  a long  sequence  of  transmitted  symbols.  Any 
series  of  transmissions  may  be  regarded  as  one  single  message.  This  sequence 
liEo  a particular  frequency  of  occurrence.  If  the  elements  in  the  message  are 
independent  as  defined  in  , we  can  compute  this  probability  easily. 

Let  the  sequence  v contain  elements  of  type  i.*  (in  testing^  we 
would  speak  of  a sample  containing  persons  in  each  category.)  Then  if 
is  the  probability  of  the  sequence  v occurring, 

p^  = II  (?^)  ^ (20) 

Obviously,  p^  is  oui’  confidence,  in  the  absence  of  received  information,  tiiat 
a particular  sequence  will  be  v.  That  is,  is  oui'  a priori  coxifidenc-^  in 
the  hypothesis  x = v.  Then 

log  ?v  = 2 IL  log  P;j^  = N 2 w^  log  Pj_  (2i) 


As  IT  becomes  very  large,  Wj^  and  uiider  the  requirement  of  independence, 


-‘-'Ufc  p.. 


V 


t~. 


(22) 


Hi  = 


log2  P-^ 


(23) 


As  IT  -f  00  the  probability  of  ail  sequences  becomes  the  same  and  for  any 
sequence  approaches  2“HH^  where  H is  defined  by  (19) . Py  is  oui'  a priori 
confidence  that  we  can  infer  the  sequence, and  -IJE  is  log  p^.  Hence,  from 
(12),  if  we  can  encode  a whole  sequence  at  once,  KH  tells  us  the  number  of 

' T3TT 

standard  items  required  to  convey  the  sequence.  expresses  the 

a priori  confidence.  This  interpretation  is  not  especially  useful  for  test- 
ing, however,  because  we  cannot;  "encode”  sequences  of  persons. 


Finally,  we  may  note  that 


p,  = II  p.  * 

(24) 

and  from  (23) 

« , 'tt  %• 

Hq  - - log  til 

(25) 

That  is  to  say,  - II  is  the  log  geometric  moan  of  the  p^. 


Cautions  in  empioyinr,  5 to  measure  uncertainty.  The  custom  has  arisen 
of  referring  to  H as  a measijre  of  imcertainty.  We  can  indeed  interpret  H or 
T as  statements  of  the  amount  of  ijiformarion  lacking.  In  these  formulas, 
however,  tliat  information  is  expressed  in  terms  of  the  number  of  standard 
items  required  to  attain  perfect  confidence.  Thus  H and  ~ are  expressed  on 
a scale  of  message  space. 


In  evaluating  a test,  we  might  consider  applying  formulas  based  on  H or  t 
to  assess  uncertainty  after  testing.  This  measure  tells  uc  how  much  message 
space  is  required  to  eliminate  residual  uncertainty  under  certain  conditions. 


it- 


The  symbol  i is  used  for  the 
tute  the  complex  input  v. 


separate  elementary  inputs  wlileh 


consti- 


2h 


The  fornula  for  t assumes  that  the  test  is  being  given  sequent j ally each 
peiiion  being  given  just  the  number  of  items  needed  to  classify  him  at  the 
de  ill  red  confidence  level.  The  formula  for  H assumes,  in  Shannon's  words, 
thar,  delay  at  the  transmitter  is  possible;  i.e.,  that  many  inputs  (persons) 
can  be  sent  throiigh  the  channel  at  once.  This  has  no  meaning  for  the  tester. 
While  it  is  probably  true  that  a test  which  reduces  uncertainty  as  measured 
by  t,"  will  also  show  similar  effectiveness  (relative  to  other  tests)  when 
evaluated  by  any  ether  formula,  we  can  give  no  exact  interpretation  to  the 
forauias  of  this  section  except  in  terms  of  sequential  testing. 


Since  the  scale  of  message  space  is  logarithmically  reluLed  to  the 


uaS  tjqual  uXiXZS*' 


In  terms 


the  gain  from  giving  the  test,  the  confidence 
pretable  than  the  log  confidence  (t)  scale. 

- -w' 


scale  is  more  directly  inter - 
Suppose  we  are  hiring  men  who 

f m'\  f-’U  ^ -.X* 

MO.  V ^ / U&IVA  X WX  WXAC  WX 


s-nplicity  assume  that  every  S man  hired  is  worth  $1,000  to  the  employer,  and 
every  U is  worth  $0.  Assume  further  chat  among  applicants  pg  = -125* 

A priori  confidence  is  .125  ('^  = 3)  and  in  N decisions  the  company  gains 
vl25N.  Nov'  Test  1 selects  applicants  25^  of  whom  ere  3.  Confidence  at  this 
stage  is  .25,  H = 2;  and  t.hi=  vylr'e  of  the  men  hired  is  $250JI.  Test  2,  given 
in  place  of  Test  1,  identifies  r.en  50fj  of  whom  will  be  S.  Then  confidence 
becomes  .50,  H = 1,  and  the  value  of  decisions  is  $500N.  The  gain  in 
"message  space"  is  one  bit  from  Test  1,  and  two  bits  from  Test  2;  but  the 
dollar  gain  is  S125  from  Test  1,  $375  from  Test  2.  It  is  clear  that  Test  2 
is  more  than  twice  as  useful  to  the  tester  as  Test  1. 


Because  the  H scale  is  not  linearly  related  to  practical  gains,  we  shall 
turn  in  a later  section  to  utility  measures.  In  the  -remainder  of  the  present- 
section,  've  present  developments  we  have  arrived  at  tlirough  the  Shannon  model. 
The  concepts  irc.-olved  are  thought -provoking.  The  formulas  themselves  are  not 
the  most  suitable  basis  for  evaluating  a test,  and  are  ultimately  to  be  re- 
placed by  utility  measures  representing  the  same  concepts. 


Information  as  the  Reduction  of  Uncertainty 


We  conceive  of  a continuum  from  great  uncertainty  to  coc^ilete  certainty. 
At  any  time  our  degree  of  uncertainty  can  be  located  on  uhis  continuum  in 
terms  of  K or  some  other  uncertain'«;y  measure.  represents  a priori  uncer- 
tainty as  we  begin  testing.  After  any  stage  of  testing  we  have  a residual 
uncertainty  which  is  equal  to  or  greater  than  zero. 

Measure  of  Residual  Uncertainty  after  Testing 

VJe  may  assess  this  residual  uncertainty  by  the  same  logic  used  to  define 
H^.  After  an  item  or  series  of  items  has  been  administered,  each  person  is 
placed  in  one  of  many  possible  v categories,  depending  on  his  response  (or 
configuration  of  responses). 

The  confidence  with  which  we  can  classify  a person  whose  response  y is 
known  has  already  been  defined  by  equation  (l)  as  p,,/.,..  Now  what  is  the 
average  number  of  standard  iters  required  to  place  persons  falling  In 

the  y category  c*.  into  the  proper  x category  with  complete  certainty?  Wr  can 
insert  conditional  probabilities  in  equation  (19) . 


25 


Nuiriber  of  inclepenu;!:il  tesla  por  person 


L;elcre 


H, 


H ■ 


n*  n 

Accepted  Coniplete 
stepping  Certainty 
oo  int 


H' 


H'O 


Figure  2.  The  confidence  continuum  vie’/ied  in  terms  of  message  space 


K / 
X/-X 


2 p / log  p / 

X x/d-  x/a 


(26) 


If  represents  the  number  of  items  required,  averaged  over  all  response 

catcgoi'ies. 


H 


A, 

/ J 


= -22 


P,,  P^/„  log  P^ /„ 

J J -V  J 


(27) 


This  is  the  residual  uncertainty  after  aam?l:iistering  vhatevei-  tests  yielded 
responses  y=  Shorrou  this  "equivocation,"  denoting  it  by  H„(x) . It  is 

analogous  to  the  standard  error  of  estimate  of  conventional  test  rheory. 


We  will  find  it  useful  hereafter,  to  ?-e'^er  to  the  uncertainty  at  various 
stages  of  testing  as  Hq,  Hy,....  We  shall  ordinarily  use  Eq  for  the  uncer- 
tainty before  testing,  and  Hy  for  the  'oncertainty  after  administering  a 
particular  test.  Subscripts  can  be  defined  in  the  context  of  any  discussion. 


The  Confidence  Continuum  and  the  Measure  R 


The  test  moves  us  along  the  continuurii  of  "information  needed"  from  the 
point  Hq  to  the  point  Hy  (see  Fi/pure  2).  Before  testing,  we  required  Hq 
standard  items  per  persoxi  lo  become  certaixi;  after  testing,  we  need  only 
such  items.  It  is  obvious  that  the  test,  then,  gave  us  information  equiva- 
lent in  message  space  to  Hq  - Hy  standard  items.  It  is  this  difference 
which  Sliaruion  calls  R,  the  "rals  of  transmission  of  information." 

R = Hn-Hi  ---2p  log  p - (-22p  p/  log  p / ) (28) 

- X y X x/y  ^ x/y'  ' 

Shannon  demonstrates  these  useful  identities: 

R = - 2 t)  log  p - f-22p  p/  log  p / ) (29) 

-y  '^^y  ■ xy-^x  ^y/x  y/x^  ' 

R = 22  Pxy  log  Pxy  - 2 Py  log  Py  - 2 Px  log  Px  (30) 

A.  mnemonic  device  for  these  relations  ia  Miller's  logical  diagram  reproduced 
in  Figiire  3-  In  the  notation  of  the  diagram,  equations  (28)  - (30)  become 

J'  = - ^..hr  = ‘V  - Vx  = - ^.:y  3x  + Ky 


(31) 


26 


Total  variation  in  systeffi 


or  t'r’rnsr.itted 
irxio'^T:i3ticn 


-V' — 

(o^:tput) 


Variation  in  reanonaes 


Figvira  3» 


Relation  bct’veor.  various  sorts  of  variation  oi'  'nnc^rtaintj^ 

(after  i.ii.llor) 


R represents  t’le  change  in  uncertainty  as  a 
expressing  the  dlfferei:ce  in  terms  of  the  n'cmher 
ai'cer  nesting.  c../;nparef.  to  the  number  rsedeh  befo 
sense;  R states  hcv  maiiy  standard,  items  our  test 


rcs'ilt  of  testing, 
of  standard  Items 
re  besting.  Thus, 
is  eoui valent  to. 


required 

t.r.  £ 


ExhauBtiveref'f.  and  Dependability 


While  R is  a meas^are  of  the  discriicinating  power  of  the  itca  (or  test) 
we  may  also  be  interested  in  the  question,  "How  many  independent  items  like 
this  would  be  required  to  classify  all  persons  with  the  desired  degree  of 
confidence?"  If  we  have  a series  of  items  having  the  same  trans ion  matrix, 
but  independent  as  specified  by  (t),  then  the  average  number  of  items  of  this 
type  required  for  certainty  is  n,  where 


This  index  i r thi=>  "average  sainpl e number"  expressed  in  terms  of  independent 
items  all  having  the  same  R as  the  item  under  consideration.  If  we  are 
willing  to  discontinue  testing  when  certainty  reaches  C 

(33) 

where  H'  = - log  C.  This  is  a more  general  version  of  (32). 


T.T  tJ  ! 


H,  - H, 

O X 


The  Index  ox'  Exhaushiveness 


The  reciprocal  of  n' is  a measure  of  exhaustiveness,  which  we  shall 

designate  J / . 

x/y 


H •-  R 

G ~ 1 _ Information  obtained 

Fr.  - H'  Information  desired 


(3'0 


Here  the  "information  desired"  consists  of  the  x^  classifications  at  some 

b 

specified  level  of  confidence.  As  before,  "information"  is  measured  in  terms 
of  message  space.  This  index  of  exhaustiveness,  ordinarily  v;ith  H'  as  zero, 
has  been  used  by  some  followers  of  Shannon  as  a measure  of  fidelity  of 
transmission  (e.g.,  23). 

The  Index  of  Dependability 


Whereas  ejipresses  the  extent  to  which  tJie  ixiformatlon  reported  in  y 

permits  determination  of  x,  it  is  possible  to  reverse  the  process.  We  might 
wi'ite  a comparable  ratio  to  opacify  th^  degree  to  '.'hich  x determines  y.  V7e 
shall  use  the  symbol  K for  this  ratio,  although  it  would  be  equally  logical 
to  use 

The  distribution  of  obtained  data  ("output")  constitutes  a series  of 
reported  individual  differences.  Some  of  v.hcse  statements  are  error,  intro- 
duced by  "noise"  in  the  channel.  In  a noise-free  channel,  all  the  individual 
differences  in  output  would  be  determined  by  the  transm,ittecL  signal,  i^et,  by 
the  criterion  information.  A statement  of  the  extent  to  which  the  output 
•Luformasion  is  relevant  to  the  criterion,  then,  is  obtained  if  we  measure  the 
amo’-int  of  information  in  the  responses  y,  and  then  find  out  what  fraction  of 
that  inforiaetion  is  R. 


23 


The  output  uncertainty  Hy  is  defined  just  as  is  in  terras  of  the 
nioinber  of  standard  messages  required  to  convey  this  information. 


Hy  = - 2 P,,  1=S  Py 


(35) 


We  may  thinW  of  Ey  as  representing  the  amount  of  differentiation  the 
test  claims  to  make,  and  R as  the  amount  tiiat  is  criterion-relevant.  Per- 
haps "relevance"  (cf,  C'rreton,  12.  p.  62k)  is  a better  term  fox'  the  ratio 
of  these  then  "dependability,"  but  for  the  present  we  will  eiaploy  the  latter 
term  as  in  ovo:  earlier  report.  The  extenl  to  which  output  variation  is 
determined  by  the  input  is  expressed  in  the  dependability  ratio  K: 


IC 


Information  obtained 

Information  required  '^■o  'p'neoifv  outrut 


r~,c\ 
\ o^/ 


As  in  (3^) > we  could  introduce  H'  into  the  denominator  of  (36)  to  allow  for 
any  desired  degree  of  specification  of  the  test  score. 


■ our  earlier  report,  Ouastier  developed  the  diagram 
reproduced  here  as  Figisre  k.  In  Q.uastler's  notation,  cr  represents  the 
criterion  x,  and  ^ represents  the  test  output  y.  I is  a substivute  for  R, 
and  the  other  changes  in  notation  should  be  obvious.  It  is  evident  from 


this  diagram  tha1 


n 


how  much  the  test  and  criterion  have  in  common  (as 


mc-as'ored  in  terms  cf  standard  units;.  J and  K,  respectively,  tell  what  rela- 
tion R bears  to  the  total  desired  information  and  to  the  total  reported 
information  Hy. 


Signifloancn  of  the  Formulas 


Identifying  three  aspects  of  the  irh’ormation-carryirig  capacity  of  a test 
in  itself  implies  that  it  is  insufficient  to  evaluate  the  test  by  a single 
index  such  as  R.  It  may  at  times  be  far  preferable  to  have  a test  vrith 
high  K — for  example,  a pathognomonic  sign  which  makes  highly  trustworthy 
discriminations  — even  though  the  test  does  not  answer  many  questions  that 
interest  us  and  so  leaves  us  with  considerable  residual  uncertainty.  At 
other  times,  we  miglit  be  content  to  have  a veipy  low  degree  of  dependability, 
provided  that  by  sifting  a large  amount  of  this  undependable  information  we 
could  pi'edict  the  criterion  more  exhaustively.  An  example  is  observation  of 
a pupil's  oral  report  to  a class;  this  provides  cues  regarding  his  tensions, 
post'ural  habits,  interests,  articulation  habits,  knowledge,  grammatical 
ability,  and  so  on.  The  cues  are  unreliable,  but  nonetheless  useful  as  a 
general  sui'vey  of  the  pupil. 


Fig>orc  5 show's  the  various  pattern.s  which  can  occur.  When  the  test  and 
criterion  are  defined  so  tliat  the  number  of  categories  in  each  is  about  the 
same,  the  diagrams  in  the  top  row  are  likely  to  applju  Whether  EL.  = Hy,  of 
course,  depends  on  the  frequencies  in  the  categories.  The  top  diagram,  shci-rs 
conditions  fo'und  in  reliability  studies,  or  in  comparisons  of  predicted 
classification  to  actual  classification.  The  second  row  represents  the  case 
where  the  criterion  divides  people  finely  into  many  categories.  This  occurs, 


Figure  4 (next  page) . Fxiiaustiveness  and  dependability.  Reproduced  by 
permission  from  (24) . 


SCALE  (S  tsI 


Lor 


U 


= L o'^  K 


Hii,h  J “ 


-V. 


*^^5  c;  jp  r; 


criterion  difforentinte  eoualiy  finely 


. rosSjible  relations  arong  H , ■}  , R,  J an  i K 


Figure  5 


for  irstanc'^;  where  we  desire  precise  measurenicnts . It  also  occurs  if  the 
criterion  to  be  described  is  a complex  configuration,  many  such  configura- 
tions being  possible,  i'he  third  row  represents  the  relatively  simple 
criterion  combined  with  complex  (finely  divided  or  multidimensional)  test 
information. 

Undej  some  circumstances,  having  a given  it  is  wise  to  seek  a teat 
which  will  have  a similar  Ky;  but  for  other  testing  problems  Wf  might  be 
wiser  to  make  Hy  larger  or  smeller  than  %.  In  general,  dependability  is 
more  to  be  valued  in  making  fiiial,  irreversible  Judgments.  If  judgments 
are  tentative  and  it  is  practical  to  reverse  them  on  the  basis  of  later 
evidence,  dependability  can  be  sacrificed  for  exhaustiveness. 

Responses  tc  9 t^^pical  test  will  have  a very  high  uncertainty  H^.  If 
thc*^c  arc  n items,  each  wi'^'t  alternatives^  then  there  are  e^  possible 
configuratiors  of  responses.  The  uncertainty  is  reduced  if  we  score  all 
Items  as  right  or  wrong,  for  thex-e  are  then  only  2^  configurations.  Adding 
scores  into  a total  reduces  the  number  of  response  categories  still  further 
because  variations  in  pattern  are  ignored.  Eacn  such  successive  reduction 
reduces  Hy.  V.’hen  Hy  is  reduced,  either  R or  Hyy'^j  is  reduced,  or  both.  Thai 
is,  the  variation  now  being  ignored  may  be  criterion  relevaxit  or  may  be 
irrelevant.  Such  scoring  procedures  as  those  mentioned  above  ordinarily 
increase  dependability,  since  it  is  unlikely  that  P.  is  reduced  more  rapidly 
Than  Hy/j^..  But  so  long  as  any  of  the  variation  eliminated  is  criterion 
relevant,  both  R and  J are  reduced,  perhaps  to  an  in^ortant  extent. 

An  increasing  interest  is  being  shown  in  method.s  of  drawing  inferences 
from  test  data  which  use  more  information  per  item.  According  to  the  fore- 
going argument,  any  such  more  complex  analysis  should  permit  an  increase  in 
validity.  Perhaps  the  gain  in  validity  may  be  too  small  to  have  practical 
value,  but  some  gain  may  be  expected.  Among  the  devices  which  promise  to 
use  more  of  the  information  iJi  the  test  responses  are 

(a)  Considering  which  «Tong  alternative  a person  selects,  when  he 
mal’.es  an  error  on  an  ability  test 

(b)  Considering  configurations  of  responses  to  various  items 

(c)  Asking  tne  person  to  mark  more  thtui  uue  altexuative  per  item, 
as  in  the  Trover -Angell  self -scorer  or  Coombs'  recent  proposal  of  directions 
to  "mark  all  wrong  answers"  (l,  9) - 

I'Ton-symmetrl'-  vt;lldity  relations.  The  introduction  of  J and  K suggests 
the  importance  of  regarding  validity  relationships  as  possibly  non- 
symmetrical.  A test  which  permits  accixrats  prediction  in  one  direction 
(y  to  x)  may  be  quite  inaccurate  in  predicting  in  the  reverse  direction 
(x  to  y) . Suppose  we  have  two  types  of  classification,  x and  y,  neither  of 
which  is  necessarily  regarded  as  the  criterion.  For  example,  we  might  wish 
to  know  whether  a person's  placement  in  certain  interest  categories  predicts 
his  personality  structure:  or  whether  the  Information  about  his  personality 
prediers  his  interest  classification.  Either  direction  of  interest  is 
legitimate.  The  joint  distribution  might  be  of  the  following  form,  wnere 
the  cell  entry  represents  joint  probability  Pvy 


Personality  structure 

(x; 


ABODE 


Interest 

b 

Category 
f n 

\ fc/  / - 


C 

0 


n 

X 

Cl 


0 


X 

0 

n 


0 

0 


In  this  rejlationship^  we  can  predict  interest  classification  with  certainty 
from  the  personality  data,  but  the  reverse  inference  cannot  be  made  as  con- 
ridently.  Interest  information  is  determined  by  the  personality  data 
(Jy/x  l.OO)  but  the  interests  do  not  specify  the  personality  < l.OO) . 

Measures  such  as  r and  X ^ treat  both  directions  of  inferencs;  symmetrically ^ 

In  contrast,  non-symmetric  nsaa’orss  like  J and  K — or  the  curvilinear  cor- 
relation eta  for  ordered  categories  — treat  the  inierences  separately* 


rhe  following  discussion,  repeated  from  our  preliminary  x^eporL,  seems 
an  important  consequence  of  the  above  argument.  Perliaps  most  attempts  at 
qualitative  assessment  of  performance  or  personality  run  afoul  of  non~ 
symmetric  relations.  Suppose  a given  lest  (or  set  of  tests)  offers  c pos- 
sible cor.figurations  and  there  are  C possible  criterion  configurations,  i/hen 
C > c,  prediction  from  test  to  criterion  will  be  more  hazardous  than  inferring 
the  test  pattern  from  the  criterion  (i.e.,  postdiction) . Now  it  might  be  con- 
tended that  a complex  test  affords  a practically  infinite  number  of  possible 


X’eSporioC  diiu  uiisu  oiic  nu:.iucr  oT  ouxuc'bux'co 


X 


^civ.Lx  ycj.  K/j  Q v.>x  U.V*..  w U4X  c 


rT44*V»  •nv^^r 
rr  X VX4  uxAjr 


infinite  variety  of  situations  to  produce  the  criterion  behavior,  unless  we 
can  specify  the  criterion  situation  in  advance.  Inferring  piersonality 
structure  from  behavior  will  probably  always  be  more  secure  ("exhaustive") 
than  predicting  behavior  from  structure. 


An  ic^ortant  proposal  for  test  development  follows  from  this,  i/hen  a 
test  is  intended  for  a specific  selection  task  where  there  are  only  a few 
clearly  defined  criterion  categories  (i.e.,  success  or  failure  to  attain  a 
T)T'of  i c1  enev  standa-rd  in  Ipanninr'  to  t.-irn»V.  firrii-rncv  of  r)rpdictlcn  from  test 

j,  ^ . . w’~'i;x"~#^  t.  * 

to  criterion  should  be  the  first  concern  of  the  investigator.  Sir.ee  he  is 
likely  to  use  multiple  tests,  Hy  will  exceed  and  it  will  probably  be 

easier  to  infer  from  test  to  criterion  than  the  reverse.  This  la  obvious 
with  relation  to  the  school-prade  criterion,  where  it  is  easier  to  predict- 
that  a student  will  fail,  knowing  test  scores  and  the  curriculum  to  be 
studied,  than  to  tell  what  requisite  he  was  weak  in,  knowing  only  that  he 
failed. 


On  the  contrary,  when  the  criterion  is  highly  complex,  as  in  the  case 
where  one  can  achieve  "success"  in  a variety  of  ways,  or  where  the  situation 
in  which  success  is  demonstrated  varies,  it  will  generally  be  easier  to 
"postdict"  from  performance  to  the  test.  Until  prelimlxiary  research  lias 
developed  a sourri  basis  for  making  postdictions,  it  is  hopeless  to  try  to  use 
the  test  predlctively.  If  the  complex  criterion  behaviors  are  pooled  (per- 
haps by  some  rater)  into  a single  index  of  success,  unlcsG  wc  laiow  the  basis 
on  which  the  complex  behavior  is  implicitly  evaluated  and  weighted,  it  is 
hopeless  to  try  t.o  predict  the  simplified  criterion. 


33 


To  determine  vhich  tests  are  •proinlsiRg  for  further  study  and  develcyrner.t  ^ 
pcstdictlon  research  should  cft(?n  be  the  first  attempt  at  validation.  This 
applies  especially  to  personality  tests.  To  implanent  this  suggestion  re- 
quires an  adequately  complex  criterion  which  provides  ini’ormation  about  the 
situational  pressures  and  demands  under  which  the  person  works.  Then,  unless 
enough  case?  are  available  for  likelihood -ratio  analysis  or  the  like,  the 
assessor  would  make  "postdictions”  using  his  theory  regarding  the  tested 
behavior.  He  sight  report  "In  view  of  the  criterion  behavior  of  this  person 
I expect  him  to  show  good  performance  on  test  A but  poor  performance  on 
test  B."  Such  inferences  can  be  validated,  against  the  test  reccru.  Interest 
of  course  centers  on  ^test /criterion*  inference  will  be  diffi- 

cult, but  far  less  difficult  than  the  task  t:,-pically  posed  to  assessors  in 
studies  like  that  of  Kelly  and  Fiske  (20) . Their  validity  is  judged  by 
‘^critarion/test^  though  the  criterion  is  derived  in  unkno’-n.  way?  fr-oTr, 

very  complex  situation  such  as  the  performance  of  a clinician  in  internship. 


Natux'ally,  when  the  end-goal  of  test  development  is  prediction,  one 
dees  not  cease  research  when  a test  has  shown  postdictive  validity.  Ovir  pro- 
t research  should  employ  h^'petheses  wherever  possible  that 
sriried  snd  thst  in  ITlelds  where  Dredichion 


pubtii  zs  Only  'til 


is  difficult  because  Hj,  > formal  postdiction  studies  would  be  the  appro- 
priate first  line  of  attack. 


Relations  I.a>olving  a Fallible  Criterion 


We  may  consider  a more  coirolex  case  which  shows  how  error  of  measurement 

may  be  taken  into  account  in  assessing  validity.  Quite  often,  the  available 

Lransitloii  matrix  of  p,,/...  lo  not  based  uuon  the  desired  or-  criterion  inforsa- 

y/x 

tion.  Instead,  a fallible  criterion  x is  used,  and  we  desire  to  predict  some 
true  criterion  . x is  viewed  as  generated  from  y;.  . It  may  be  possible 
to  estimate  (or  hypothesize)  some  rate  of  error  in  generating  from  .X.  . 

Then  it  should  be  possible  to  investigate  how  much  information  y gives 
about  y.  , and  what  freer  ion  of  the  desired  infoj'"aation  is  obtained. 


The  pi'oblems  of  inference  involving  three  variables  would  require  an 
extensive  digression = The  interested  reader  should  consult  McGill  (21).  VJe 
shall  sketch  very  superficially  the  concepts  for  this  analysis.  The  correc- 
tion for  criterion  fallibility  ■!?  e -PairTy  simple  case  of  multivariate  in- 
formation analysis. 

If  we  adapt  the  Miller  diagram  to  three  variates  we  can  sketch  Figure  6. 
A large  pumber  of  identities  can  be  constructed  by  the  reader.  The  diagram 
is  interpreted  like  Figure  3*  common  information  in  all  the 

variables.  - Hj,xy  is  the  information  common  to  x and  y but  irrelevant 

to  . And  so  on.  The  reader  should  not  assume  that  the  circles  have  any 
specified  relative  size,  or  that  the  areas  shovm  are  proportional  to  the 
various  H and  R. 


In  tne  case  we  are  presently  interested  in,  we  may  assume  that  the  error 
in  the  fallible  criterion  (Hx/t,^  independent  of  the  error  in  predicting  the 
test  response  from  the  true  criterion  (H-)/;;^) . Let  us  assume  further  that  y 
and  X do  .net  overlap  in  any  way  independent  of  (i.e.,  that  they  contain  no 
common  factor  irrelcvuat  to  ^ ) . This  condition  specifies  that  R^y  ~ ^/xy* 


We  also  require  thet  Ry,.  = this  states  that  error  is  random.  Ihen 

Fig'ore  6 takes  the  foi”""  snovii  F -jjare  7_.  the  circles  of  Pig'oi'e  6 being 

altered  in  shape  to  conform  to  the  new  conditions.  Ue  can  read  off  various 
identities  (which  would  otherwise  be  derived  from  definitions  involving 
Px/t  ^ Pv/-->  -tc.,  and  cur  aLitiUmptions)  . 

The  desired  information  is  But  the  desired  information  in  x is  only 

faiiibie^  our  exhaustiveness  is  not  repi'esenteu  by  K^y  / -^x  " 
Instead  we  may  be  interested  in  either  of  two  questions; 


(l)  What  fraction  of  the  desired  information  (measured  in  message 


^•***->  — . N + V>  ^ ^ 

'uX  S..*  A.  ••'-J. 


(37) 


If,  contrary  to  assumytion,  y and  X should  contain  some  comnon  factor  not 
in  X,  P^y  > but  if  data  exist  to  meas’ure  R^y  directly,  multivariate 

treatment  is  not  culled  for. 


Figure  7.  Relation  of  true  criterion''>:^,  f'i?Lliblc  criterion  x,  and  test  y. 


(2)  bhr 


anov 


01 


'olevant 


the  ar.'.ount 


desired  infornal 


exiiaustivenass  index  be  J., 


yJ'J  ' 


infornation  did  we  obtain  relative  to 
the  fallible  critericnV  Let  this 


,T 

/-.r 

“CO  /O' 


Formula  (38)  is  closely  compui'able  to  the  conc;_pt  of  "efficiency" 
employed  by  tr'isiccr  to  evaluate  the  amount  of  information  elicitad  by  a 
statistical  procedure.  Fis/^er  regards  the  procedure  as  a device  (' "comjr.vinica- 
tion  channel,"  we  mi gilt  say)  for  gaining  information  about  parameters  of  a 
population  distribution,  eiroloylng  sample  data.  lie  defines  intrinsic  accuracy 
as  equival.ent  to  "tlie  arnc^ant  of  ird'ormation  in  a single  observation  belonging 
to  such  a distribution"  (16,  p.  7^9/-  This  clearly  resembles  the  "average 


iniorma'ci on  per  item. 


fhe  efflclc'ncv  of  a statistic  is  t.he  ratio  of 


the  intrinsic  accuracy  of  its  i-andom  sampling  distribution  to  the  amount  of 
information  in  the  data  from  which  it  nas  been  derived"  (10,  p.  71'^)  • This 


JO 


latter  phrase  describes  a concept  very  close  to  the  amount  of  relevent 

information  in  a fallible  criterion  •'.-here  all  error  is  random,  and  the  ef- 
ficiency ratio  i"  comparable  to 

Forauila  (38)  is  a correction  for  crite::ion  attenuation.  It  is  thor- 
oug’ily  comparable  to  the  formulas  custiomariily  used  in  corre2.ational  i-rori’:. 
Similar  formula.0  can  be  vorked  out  to  correct  for  test  -arn-eliability . It 
has  been  noted  by  Garner  and  Hake  (I8)  that  H has  some  of  the  properties  oi 
a contingency  coefficient.  Our  statements  in  a preceding  paragraph  show 


that  we  may  regard  J and  h as  comparaoie  1.0  eta,  in  tinat 
tingency  coefficient)  eta  is  directional.  If  we  regard  A.  and 

Q.S  suioetbiG  liicsstix'cs  oiT  ‘'G*Lci't.xorjiSi't j- 

ir,2  thev.  by  comparing  tshem  to  their  maximini  values. 


V 


^/y  '■^'y/^-^ 


then  we  have  in  (38)  a way  of  con 


Since  WG  question  the  use  in  test  analysis  of  measures  based  on  H,  'chese 
comments  lead  primarily  to  the  suggestion  that  other  measures  of  relationship 
for  categorical  data  can  be  corrected  for  aLtenuatlon  in  a similar  marmicr. 
Correction  for  attenuation  is  used  to  Judge  whether  a given  relationship  is 
h-igh  relative  to  wiiat  a fallible  criterion  -permits.  It  is  used  to  judge 
whether  an  uni-eiiable  test  is  sufficiently  saturated  -with  valid  variance  to 
justi-fy  lengthening  the  test  to  obtain  greater  reliability.  While  the 


ass'uiiiptioiib  uf  ^'aiiuOmiiGb&  of  Gi'iT'Cr  arG  ui'tcii 


CU.1U.  CIJ- 


sampling  error  of  corrected  coefficients  is  often  very  large,  nearly  all 
interpretation  of  correlations  i’nvolves  at  least  a crude  application  of 
attenuation  formulas,  and  such  formulas  for  categorical  data  would  indeed  be 
useful. 


bhs  F*02muX~s  “to  Oi*d.GrGd.  ScbI^s 

While  we  have  so  far  discussed  the  measures  of  uncertainty  and  informa- 
tion in  terms  of  a distribution  of  persons  into  a set  of  unordered  cate- 
gories, Shannon  demonstrates  thau  they  can  be  vrritten  in  different  form  when 
categories  are  ordered  and  there  is  a known  frequency  distribution  on  some 
underlying  measurement  scale.  In  pai-ticular,  the  rectangular  and  normal 
distributions  are  of  interest. 

The  Rectar.gular  Sistribut i on 


Suppose  that  x is  a continuous  variate  having  a rectangular  distribution 
in  the  population  under  study.  Then  in  any  score  interval  ox,  p,,-  is  a con- 
stant equal  to  <f'-x  divided  by  the  range . We  shall  denote  the  range  by  2m; 
n..  =^. 

" ^ 2ri 

^ ^ Px  Px  = Px  logDx  + 2 Fy  log  2m  (39) 


= - log  A>:  -t  log  2ra  (ko) 

Ey  dividiipg  the  scale  into  smaller  and  smaller  units,  we  increase  the  differ- 
entiation in  the  criterion  scale,  i.e.,  the  information  desired.  As  ox  is 
allowed  to  become  infiixltely  small,  becomes  infinitely  large. 


37 


Ir.stesrI  of  inquiring  hov  nany  standard  items  are  required  to  provide 
infinitely  fine  differentiation,  we  iiay  set  some  particular  degree  of  resolu- 
tion as  desirable.  That  is,  we  nay  say  that  we  wish  to  differentiate  persons 
wi'oh  certainty  into  intervals  of  sine  u,  and  that  differentiation  beyond  that 
fineness  does  not  interest  us.  This  is  equivalent  to  saying  that  v:e  will 
accept  a residual  uncertainly  !I"  inhere,  since  u is  the  a posteriori  range, 


K"  = - log  i.x  + log  u 


(41) 


H”  is  conceptually  like  H*,  but  is  specified  in  a different  way.  Therefore, 
the  information  desired  is 

- H"  = ^og  2m  - log  u (42) 


It  will  be  noted  that  if  the  range  2m  is  expressea  as  a mullipic  of 
u — i.e.,  if  u is  tf-'ken  as  the  unit  of  measioi'eraent  --  the  desired  informa- 
tion becomes  log  2v\-.  Tn  general,  whether  categories  are  ordered  or  not,  if 
they  are  equally  probable  the  uncertainty  is  the  logarithm  of  the  number  of 
categories . 

The  iNoiiaal  Distribution 

tf  the  distribution  of  x is  normal. 


f(x)  = 


^ ^-(x2/2a^) 


2 iia 


(43) 


In  any  differential  element  whose  midpoint  is  and  whose  width  is  ^x, 
the  proportion  is  f(x^)^x. 


-log  = log  27-a  + -in 


X?  lo.To  e 


logiLx 


? - 


(44) 


oo 


h-  = 2 * Pv  lug  p„  = Z P-.  log  2r;(j  + Z 

-00  T ^ 1 a. 


xf  log„  e 

1 Z log/_ix  (45) 

oa‘~ 


H„  = log  2 •■'CT  + 


•*-up2  ^ 


Z n?  - log  X 


Pa 2 i 


(46) 


r>  O 

But  Z p is  the  average  x , and  as  '.-.x  becomes  verv  small  this  approaches 


X 


Lim  E ■-  log  2)fcr  + ~ '.‘.ogp  e - log 

* X CJ 


i^l) 


Liili  IL,  = log  V2  i7  ecr  - log  Ax 


(48) 


38 


This  equation 
f(x)  as  given 
tinuous  curve 


holds  if  the  (differential  elements  are  sufficiently  small  tncit 
by  the  normal  curve  is  very  close  to  f(x)  given  by  the  discon- 
based  on  differential  elements. 


If  we  require  differentiation  between  elements  of  width  u,  but  do  not 
require  differentiation  into  infinitesimal  elements,  this  can  be  taken  into 
account.  Where  u is  sufficiently  small  that-  we  ra«iy  regard  the  distribution 
as  rectangular  within  the  interval,  the  desired  information  is 

Kjj  - H”  = log  o - log  u (^S) 


and  if  a is  expressed  in  the  'units  U; 

H,,  = log  \l?.  !'  e = log  v^2R'e  + log  a (50) 

It  is  of  interest  to  inquire  how  sensitive  (50)  is  to  distribution 
shape . 01 S XC' U ^ XOli  ox  X ^ Cr  ~ y'3*  Xli^— 1 JL  X WLU 

(ho\ 


C.iU 


.u.  f rr\\  ,-ij 

JDUv  I wuuxu 


H = 


log 


'2  n’  e 

^ ^ — m = log  2a  + log 


Ti  e 

j-r 


= log  2m  + log  1.19 


= iog  2a  .C.-J 

Hence,  as  distributions  become  more  platykiurtlc  than  the  normal,  formula  (50) 
overestimates  H by  an  amounL  not  exceeding  .25.  This  error  is  independent  of 
a.  (50)  mf'7  be  applied  to  ncn-ncrasl  distributions  with  greatest  confidence 
when  cT  is  .large  relative  to  u. 

R Un(ier  Assunntions  of  Normality 

If  -we  hiive  the  ordered  quasi -normal  variates  x and  y,  we  may  set  up  the 
usual  bi-variate  distribution  or  scatter  diagram  {Fig.  3).  The  error  (within 
rows)  has  a variance  which  raav  be  different;  for  different  values  of  x. 

Iiikewise,  the  equivocation  (within  columns.)  has  a variance  or,.  ,,  which  reed 

^ • »/ 

not  be  uniforr,  over  columns.  The  sigmas  ere  the  usual  standard  errors  of 
estimat-c  '^f  ta  st  from  criterion,  and  criterion  froir.  tesr,  respectively. 


Figure  8.  Bivariate  correlation 
surface. 


I 2 3 I 5 6 7 3 9 10 


39 


Assuming  normal  distrlDution  vithin  arrays,. 

r 

- log  \CT'e  + ; Py  log  o^,y  dy 


H„./  = log  -^/^Te  + / log  cr  dx 

J/  A / ■*■  “ 


From  (28), 


X\  — 


-r>  T 


log  Pj,  dx  - log  \/2rFe  - log  . d 


dv 


(51) 

(52) 

(53) 


If  the  distribution  of  x and  y are  normal,,  and  the  equivocation  uniform  over 
arrays , 


a 


or 


R = log  o - log  c = log 


R = log 

y.x 


x.y 


(54) 


(55) 


If  we  emplc^/  variance  instead,  using  V_,  for  error  in  transmitting  y,  and  Vj 


lor  error  in  inference  or  estimation  of  x fx-om  y^ 


R = i.  log  ^ log 


V 


'I  ""  "E 

By  the  usual  definition  of  correlation  (r), 

V 

— or  1 - - 

'X 


(56) 


2 'E  T I 

’’xy  ^ " V,.  ^ ~ V. 


(57) 


R = - I log  (1  - r^)  = 


- log  \J  1 - 


(58) 


The  radical  in  (58)  is  the  familiar  coefficient  of  alieixation.  Thus  we 
find  that  information  measure  is  closely  related  to  convent!^  .al  test  theory, 
provided  we  have  a normal  bi-variate  'aifatx’ibution  of  tesx  against  crmerion. 

Tne  rate  of  information  which  one  test  yields  regarding  an  equivalent 
test,  and  regarding  the  underlying  true  score,  may  be  considered  by  referring 
to  Pigtire  9=  1’^  now  let  represent  the  true  .score,  and  x and  y be  fallible 

measures  of  it.  x and  y are  independent  in  the  sense  tnat  errors  ®^d 

®y/x  independent.  Figure  9 describes  the  reliability  relations.  To 

make  x and  y equivalent,  assume  Hjj  = Hy.  and  y*  1"^ 

shown  tiiat 


-y;:2 


^xy  = = - log  V 1 - 

-y  V ^ vT 


‘xy 


(59) 

(60) 


The  radical  in  (60)  is  of  course  the  standard  error  of  measurement. 


In  deriving  R we  have  made  the  assumption  that  the  distrihution  within 
arrays  is  continuous  and  normal.  This  assumption  is  in  conflict  with  the 
ass’^-imption  made  in  deriving  ('^-9)?  that,  thp  di  c+jitiution  within  differential 
elements  is  rectangular.  The  latter  assumption  is  acceptable  for  the  dis- 
tribution of  X or  y,  but  gives  us  a contradiction  az  ~x.y  ------ 

smaller  end  smaller.  Formulas  (54)  to  (60)  apply  exactly  only  if  we  regard 
X and  y as  truly  continuous,  capable  of  being  litfinitely  finely  divided. 

But  this  means  that  and  Ily  are  infinite  (as  u is  infinitesimal)  . There- 
fore J and  K cannot  be  defined  under  a strict  assumption  of  normality.  J and 
K are  meaningful  for  the  ordered  case  provided  and  detex'mined 

from  the  discontinuous  distribution  of  persons  into  cells  u imits 

wide.  Formula  (49)  and  its  derivatives  may  not  be  used  for  this. 


The  necessary  conditions  for  (5^)  and  subsequent  formulas  are 

(1)  that  the  shape  of  the  distribution  within  arrays  has  the  same 
shape  as  the  marginal  distribution,  and 


(2)  that  the  dispersion  within  arrays  is  u 
required.  These  conditions  do  imply  that  u becomes 
continuity  is  required. 


•Ifcrni*  froi‘*insli't-y  Is 
indefinitely  small;  i . e . , 


It  will  be  observed  that  as  r increases  toward  1,  R becomes  indefinitely 
large.  This  is  pos.sible,  of  course,  under  the  ass’omption  that  the  informa- 
tion desired  as  infinite.  The  statement  of  R as  log  a„/  ,,,  clarifies  the 

meaning  of  R.  Before  testing,  we  locate  S within  a band  of  lAncertainty 
described  by  a After  testing  we  have  the  interval  of  uncertainty 
described  by  o-^.y  This  describes  uncertainty  in  just  the  way  we  do  -when 
we  append  a standard  error  (e.g.,  2.7  + -S?)  s score.  The  ratio  shows 


Ul 


hcjw  ve  have  divided  o’or  original  uncertainty.  Cutting  it  in  half  gives  ore 
biu  or  inforiiation;  cubLing  in  eighths  is  equivalent  to  three  bits.  Three 
"one-bit"  Items  independent  in  the  sense  of  condition  (5)  are  able  to  redude 
the  error  of  estimate  to  one-eighth  a 


Suiumary  ox  i^iacor  Implications 


Tv^  -.•r#-.  4 ««« ^ #>  ^ U r>  > ^ -T 

O.^CLiUJ.AACA  U X WA4  WX  OAl^  UXbWtJX  ViOXAiUJr  iiiC:;  CA  O UX  V AA  ^ CAAAVX  WA*u;  tA  *4  XA  lU  WX  X UV.T 

of  change,  R,  we  foxmd  that  these  measures  do  not  have  the  characteristics 
desirable  for  the  evaluation  of  tests.  The  formulas  apply  precisely  only 
when  patterns  of  inputs  can  oe  encoded  simulteixecusly,  a condition  not  ful- 
filled in  our  testing  problem.  As  approximations,  the  formulas  express  the 
number  of  standard  independent  items  required  per  person  to  classify  him, 
before  and  after  testing.  But  this  has  meaning  only  if  tests  are  used 
sequentially,  with  different  persons  given  different  numbers  of  items.  Hence, 
there  will  be  few  occasions  to  treat  data  by  the  formulas  of  this  section. 


The  seqv.enticii  conception  of  testing  has  great  potentiality.  Since  ‘,tb 
cun  mahe  decisrons  ebout  some  pex'sons  a*tcr  xev.‘ei’  ...ueixs  tnan  ax'o  required  lor 
others,  we  may  be  able  to  attain  greater  afflcl ency  by  aequential  testing. 
This  is  practicable  only  where  the  cost  of  administering  a sequential  plan  is 
little  greater  than  the  cost  of  administering  uniform-iength  tests  to  ail 
persons.  Evans  has  shc^wn  why  s'''’uential  testirui  can  be  recommended  for  per- 
formance testing. 


The  sequential  concept  can  be  applied  also  to  diagnosis  or  evaluation  of 
a single  person.  Suppose  a decision  as  to  college  admission  requires  informa- 
tion that  the  gubject  has  attained  a certain  level  of  proficiency 'in  each  of 
eight  areas  (English,  mathematical  comprehension,  etc.).  The  first  day  could 
be  used  for  a test  of  items  having  properly  chosen  difficulty  in  every  area . 
The  tests  would  be  scored  promptly.  Some  students  would  earn  such  high 
scores  that  a decision  to  treat  them  as  passing  in  ail  areas  could  be  made 
with  great  confidence.  Perhaps  some  could  be  rejected  with  confidence  on  the 
basis  of  the  short  tests.  It  is  more  likely  that  for  most  individuals  vre  can 
find  some  areas  where  he  is  definitely  passable,  and  others  where  it  is  un- 
certain how  he  should  be  clacoified.  In  those  areas  only,  fui’ther  testing 
on  the  second  day  is  wise,  A more  adeq’uate  decision  can  be  reached  by 
thorough  testing  in  these  dubious  areas  than  if  the  time  over  both  days  were 
divided  equally  over  all  areas.  In  general,  educational  and  psychiatric 
diagixosis  en5>loy  such  sequential  testing,  and  a statistical  model  based  on 
sequential  analysis  should  be  used  to  evaluate  the  effectiveness  of  the  pro- 
cedures . 


Attention  was  drawn  in  this  section  to  the  possibility  of  evaluating 
separately  the  rate  of  lufoimation,  exhaustiveness,  and  dependability.  These 
concepts  have  not  come  tvo  attention  in  analyses  of  test  validity  where  both 
test  and  criterion  were  taken  as  continuous.  The  concepts  are  par/ticularly 
hel  pful  in  thinking  of  categorical,  multidimerislonal  data.  Exhaustiveness 
and  dependability  are  not  necessarily  equal.  It  is  to  be  expected  that  per - 
sonality  test  responses  can  be  "postdicted"  from  practical  criteria  better 
than  can  predict  from  test  to  criteria.  Such  studies  are  recommended  In 
the  eerlv  stages  of  test  validation.  In  some  practical  situations,  tests 


irit-.i  b.;.gh  exh.a;’stivenes3  are  preferable  to  tests  of  iugii  dependability;  in 
oth.er  situat—ons^  dependability  is  tlie  desideratura. 


ine>;C;Uj’os  o 


Equations  for  correctinr;  information  nieasu'^ss  for  unreliability  have 


B o S O C LxO  .l 


■'r?,ecticn  formulas  ca.'  be  devoj.cpod  for  osher 
o-.-reen  crvegoricai  variaces. 


Tbe  conceptions  ST.'mmariaod  above 
and  we  can  expect  them  to  be  useful  in 


are  not  bound  to  the 
any  system  of  test 


Shannon  formula 
analysis. 


^3 


III.  MEASURES  OF  UNCERTAINTY  AND  INFORMATION  IN 

TERMS  OF  CORRECT  DECISIONS 
A Measure  of  Average  Uncertaiity 

In  this  section,  ve  shall  examine  a set  of  inforEcetion  formulas  based  on 
a somevEhat  different  attack  than  Shannon's.  These  formulas  are  more  suitable 
for  the  tester  rhen  Shannon's.  They  are  expressed  in  terms  of  average  con- 
fidence rather  than  average  message  space,  and  tley  assxime  that  decisions  are 
made  about  on?  pe?‘son  at  a time.  The  formulas  aic,  hovever,  much  less  gen- 
eral than  utility  formulas. 

Since  one  ox  the  formulas  to  oe  developed  in  chis  section  has  been  pro- 
posed by  otner  vnriters  as  a vay  of  judging  tests,  we  are  able  to  extend  the 
meeuilng  of  their  work.  By  placing  thslr  formula  in  the  same  perspective  as 
Shannon's  and  later  relating  it  to  utility  theory,  we  clarify  its  meaning 
and  its  limitations. 


We  may  begin  as  in  Section  II,  defining  confidence. 


OnrvP  ( Q ^ 


/l  \ 


When  we  make  many  decisions,  we  have  some  degree  of  confidence  in  each,  i.e., 
some  probability  of  being  correct.  What  is  the  average  goodness  of  our 
decisions,  or  our  average  confidence?  Each  time  we  classify  a person  cor- 
T'^ct-Xy  i n gnrn<af‘.'hi  t'octo ^^3 s oif  51*1*02*5  in  cXfiss Ot]l.S2*S • T~f*  vs 

add  the  confidence  values  for  successive  decisions,  we  obtain  the  expected 
number  of  correct  decisions.  The  mean  confidence  (thc^  mean  probability  of  a 
correct  decision)  then  is  a measvire  of  the  goodness  of  decisions. 

Call  Cjj  the  proportion  of  correct  decisions  regarding  x to  be  expected. 
To  assess  a priori  confidence,  assume  that  N persons  are  assigned  to  cri- 
terion classifications  by  chance,  Np^^  persons  being  assigned  to  category  x. 

Pjj  is  the  expected  frequency  of  category  x.  The  probability  of  a "hit"  when 
a person  is  assigned  to  category  x is  p„,  and 


{o\ 

Let  U be  the  proportion  of  misclassifications  in  a very  large  sample  of 
chance  assignments. 


Lim  C,.  = ± yn 


N x*-'^ 


Ux  = 


Eo2 


= L 

V 


(3) 


If  comp'^mid  categories,  such  as  red-even,  are  built  up,  we  can  express 
the  confidence  of  correct  assignment  simply,  provided  the  category  sets  are 
independent.  If  the  compound  category  xI'C  includes  any  person  who  is  both  x 
and  A } t*lC  two  basic  category  systems  are  independent  when  p„v  = u py. 

Then 


GxX  = 


(4) 


That  is  to  say,  the  probability  of  hits  for  assignments  to  the  complex  cate- 
gories is  the  product  of  the  average  pi-obabilities  for  the  separate  category 
systems . 


i-r-evlously  Published  F»^.rmulas  for  Measuring  Discriminating  Power 


Other  viritOiS  have  regarded  a test  as  a tool  for  dlscrifuiualiijg  between 
persons  (15;  29;  3,  p.  12^1).  A discrimination  is  a statement  such  as  "A  is 


unlike  B."  if  there  are  W persons,  we  have  - 


N (W  - 1) 


2 


possible  discrimina- 


tions to  he  made.  But  if  there  are  k possible  scores  or  categories,  axid 


C (1  - i) 
2 ' 
k 

X 

O 


b /M  ^ Jh 
2 

The  actual  number  of  discriminations  made  by  the  test  it 


k CN,  the  number  of  discriminations  cannot  be  greater  than  — (N  - n)  = 

2 k 


(h  - Ny)  or  I 


/ -I 

v-^ 


^-2  •, 
/ 


Ferguson  suggested  evaluating  the  test  by  it~  discriminating  power  rela- 
tive to  the  maxin’oni  for  the  given  ^ end  offers  the  index 


0 = . — - — n 

0,  k"  - 1 . 

- 2 pyj 

J 

(5) 

Thurlov;  takes  the  ratio  relative  to  the 

XiUiCA 

limit  set  by  N, 

p' h 

-ipj) 

(6) 

If  we  let  k or  N increase  indefinitely, 
derived  by  substituting  the  Py  in  (3). 

either  n or  . 

reduces  to  U„.  as 

J 

Thus  the  discrimination  indices  are  essentially  meacures  of  Uy,  the  un- 
certainty or  variation  in  responses,  just  as  as  expressed  in  (3)  is  e 
measure  of  the  variation  in  the  criterion. 

Geometric  Interpretatioxi  as  e Dispersion  Measure 

Formula  (3)  has  an  interesting  and  possibly  useful  geometric  signifi- 
cance. V/e  may  regard  the  x categories  ac  defining  a set  of  orthogonal  vectors 
of  unit  length.  Category  a may  be  represented  by  the  vector  1,  0,  0,  0,  ...: 
b is  represented  by  the  vector  0,  1,  0,  0,  ...;  etc.  These  vecioi-s  define  a 
coordinate  system,  and  each  person  has  a true  location  in  this  system.  A per- 
son belonging  in  category  a is  correctly  located  when  he  is  assigned  at 
point  (1,  0,  0,  0,  ...).  Figure  1C  shows  three  coordinates,  and  also  the 
point  {-p^,  i>^,  p^).  The  point  located  by  (p^.,  uv,  p.)  is  the  centroid  of  all 
points,  when  persons  are  properly  located.  Every  possible  distribution  of  a 
sample  of  persons  is  defined  by  some  point  in  the  plane  x -!■  y + z = 1.00. 

In  Figure  11,  therefore,  we  view  this  plane  directly. 

When  a priori . we  know  pa,  pp,  p^ > but  know  nothing  about  the  location  of 
individuals,  we  might  consider  all  persons  as  located  at  the  point 
M (pa^  Pb'®  Pc^  * a person  whose  true  category  is  a is  misplaced  from  his 

proper  position  by  the  distance  Mh.. 


I.:' 


rxgure  xx. 

Plane  viewed  directly 


A 


The  heav^'  circles  in  Figure  11  show  how  persons  would  be  distributed  if  each 
was  assigned  co  his  true  category.  There  are  Npg  pcrcrns  displaced  by  +he 
distance  MA.  Averaging  the  squared  displacements  over  persons  in  all  cate- 
gories, we  find  „ ^ . c 


p-  MA^  + 


Pb 


—2  ^ 

iwin  ~ 


r77r<' 


Pc  MC-  = 1 - I 


(8) 


That  is  to  say,  if  we  regard  all  parsons  as  located  at  the  mean  of  the 
uistribution  to  which  they  arc  hncivn  tc  belong  then  th.=  average  encertainty  is 
the  mean  squared  error.  This  is  a measure  of  dispersion,  the  multivariate 
analog  to  variance.  Ferguson  also  has  noted  this  analogy  to  variance. 
Essentially,  in  netting  up  this  geometric  model,  we  rave  assumed  that  any 
misclassif ication  is  as  serious  as  any  otherj  i.  e.,  tnat  assig.ning  an  a to 
category  b is  as  serious  as  assigning  a b to  category  ^ or  c.,  etc.  This 
amounts  to  assuming  a particular  evaluation  matrix  ( see  Section  IV). 

P.esiuual  Uncertainty  after  Testing 

Previous  authors  aie  pei'har^s  unwise  in  referring  to  as  a measure 
of  "the  discriminatirig  power"  of  a test,  just  as  it  is  unwise  to  refer  to 
as  "the  infoimatior.  yielded."  Uy  and  H.,,  are  comparable.  has  been  inter- 
preted as  a statement  o"  the  variation  in  the  output  messages.  Uy  indicates 
the  amount  of  reported  discrimination,  but  these  discriminating  statements 
will  not  all  be  valid,  as  Thurlow  has  pointed  out. 

A test  would  ordinarily  permit  us  to  classify  people  into  x cate- 
gories  with  greater  confidence  than  we  had  a pi-iori.  E.  ch  person  in  a given 
y category  could  be  assigned  to  an  x categor-y,  basing  inferences  on  the  known 
Px/y'  -Assuming  a chance  assignment  as  before,  if  N people  give  response  y, 
of  them  would  be  assigned  to  categox'v  x.  For  any  y,  sayc<  , the  average 


comidence  is  denoted  by  . From  (2), 


I 


46 


And  over  all  y's,  the  average  a posteriori  confidence  is 


C^Ar  = 2 Pv  2 Px/v  = 22  Pv 


fio'i 


Here  we  shall  use  Ci  to  indicate  average  confidence  in  cj.assifying  on  the 
basis  of  item  1.  we  can  rewrite  'lO)  thus: 


0^  = 22  p^/„  = 22  £22:  = 22  p2/, 

i'Ji  "/  J p XV  J —I  ' 


(11) 


xy  j 


We  obtain  residual  'mcertainty  lA  by  subtracting  C-i  from  unity. 

Wc  say  diccucc  this  in  terns  of  the  pfAr^mAt.r-?  c model  . For  each  y cate- 
gory, we  have  a set  of  Px/y  which  define  the  centroid  of  the  criterion  dis- 
tribution for  persons  giving  response  y.  Figure  12(aj  shows  the  original  M 
of  Figure  11,  together  with  the  centroids  M^,,  M^,  ?igui*e  12(b)  shows  the 

distribution  of  persons  giving  response  os  . The  dispersion  within  any 
group  is  ssaller  than  the  a priori  dispersion.  Lb  i??  the  weishted  average 
of  these  witnin-group  (residuaj.)  dispersions. 


Gain  In  Average  Confidence  (Information) 

We  eag)loy  £^C  to  represent  the  gain  in  average  conXidence  (i.e.,  in 
expectation  of  correct  decisions)  as  a result  of  testing. 


A G = Ct  - Cr\  ~ Uo  “ Gt 


(12) 


XiG  - 22  Py  pH/„  - 2 Pv  "^2  2 Py  (p|a,  - p|) 
yx  X ^ xy 


(13) 


Hence  AC  is  the  increase  in  confidence  for  any  category,  when  y is  known, 
weighted  and  summec'.  over  all  categoi-ies  and  all  values  of  y.  can  rs'vrite 
in  a form  allowing  us  to  state  the  distance  between  any  two  states 

I ■"  / -2  _2  _2\ 


r T _Ai/_  - I = r. 


/ 


X ‘ y i'v 


- Fv 


Tt 

•='V 


(14) 


2 I 2 

ir 

v'- 


-2 

^xy. 


jp  Py 
^2 


' 


-.2  \ 
^^^1  \ 


(15) 


In  the  model  studied  in  this  section,  /.^C  represents  the  ability  of  the  test 
to  reduce  uncertainty  or  to  convey  irrformation.  It  is  analogous  to  K of 
Section  II, 


Uncertainty  may  be  dec  omposed  as  in  analysis  of  variance  to  show 
various  sources  of  discrimination.  For  N large,  the  criterion  permits 

I*  (Jjj.)  discriminations.  VJhcn  a test  is  given  to  persons  in  category  a,  error 


introduces  differences  In  response.  There  are  (U^/„)  added  discrimina- 
tions,  v/hei  e 


1 


(16) 


6‘y/£.  = 


§ ‘■y- 


Summing  over  all  x categories,  the  total  error  uiscriminar  xOnS  ftTG 


il  u. 


^ ' 
V A I 


n2 


y/x  " ^ 2 V'^  ” ] 

£ ^ _2  . „ _2  . \ 


^/x  = 2 i ; 


V 


^y/x  ^ 


(IT) 


/t  on 
Vxu; 


The.j  the-  total  number  of  discriminations  is 

? . . 


U.^  -h  2 3-  U, 
2 r.  2 


\ 

\ 

r*  ^ » 

^ ^ ^xy  ! 
X y I 


(19) 


Vx  = 

The  right  hand  member  of  (19)  is  exactly  the  number  of  discriminations  (which 


ve  mi£-tr,  designate  4~  that  we  would  have  if  we  could  record  the  complete 

matrix  ('f  true  responses  and  crrot's,  thus  dividing  people  among  yx  categories. 

m2 

Dividing  tiirough  by  ~ , we  obtain 


^^xy  - 1 " 2 2 Pxy  ' ^x  ^'y/i 


(20) 


3 Scc^-ioi^  XX  is  2 (iis^r*s!2  oi*  pi^scis^X^  +.^rr:5a  r\f^  v*^  “ 

tion.  Furthermore,  we  could  write 


^xy  - ^y  ■*■  ^x/y  (21) 

The  total  discriminations  in  the  system  may  be  divided  into  true  and  error 
discriminations,  or  into  obtained  discriminations  and  equivocation  (i.e^, 
desired  discriminations  not  obtained) . The  relations  implied  by  Figures  4, 

5,  and  6,  apply  to  U as  well  as  to  H. 

The  measure  of  information  gained,  given  ia  (.ip),  is  close  to  Thurlow’s 
proposal  for  jvuigi/ig  one  number  (or  proportion)  of  valid  discriminations  made 
by  a test  when  categories  are  unordered  (29,  p 304-5).  C is  easier  to 
compute  tha.n  Thurlow's  index,  however,  and  iC  has  several  properties  not 
In  Tlnix* X o V * .'i  n!0H.5U2r6.  T*iu**Xov;  *s  ind.S2i  (TiJi'  ic 

applied  only  to  the  case  where  there  is  ar.  ol\..ous  one-to-one  correspondence 
between  the  x and  y categories.  Moreover,  some  questions  may  be  raised  re- 
garding Thinrlow's  method  of  counting  correct  d.: scriminatlons . Ferguson  and 
Bechtoldt  make  no  suggestion  for  considering  t'vi  validity  of  discriminations. 

Gain  in  certainty  can  be  translated  into  the  terms  of  analysis  of 

variaxice.  The  mean  square  dispersion  between  groups  is  p,  + p (MI'S) 

P j (MI"^.-) . This  is  equal  to  Cp  - Cq  or  Uq  - Up.  Thus  Uq,  the  a priori 
uncertainty,  equals  the  dispersion  between  response  groups  plus  the  total 
dispersion  within  response  .groups  (r(=>?id’..:cl  uncertainty;  . 


48 


Inforaation  Yielded  by  a Settles  of  Indgperdpnt  i;:teBs 

When  tvo  independent  items  are  combined,  what  is  their  total  information 
yield?  As  in  Section  II,  we  define  independence  by  the  conditions: 


P,,  y = Py  P, 


%l_yp/x  ~ %3_/x  Py2/x 


) 


(22) 


After  one  item,  a certain  number  of  persons  are  classified  in  any  category  x. 
V/e  may  denote  by  the  confidence  in  classifying  a person  as  on  the 

basis  of  item  1.  We  shell  let  be  the  proportion  of  all  persons  who  are 
correctly  classified  in  category  a.  Then  = PaCa^)=  Similarly  for  all  x 
categories,  so  that 


Cl  = 2 C 


lx 


(23) 


Equation  fll)  demonstrated  that  Ct , over  all  x's,  is  22  For 

category  a. 


Cla  = I 4yJ-9y.^  = ^ , 


2 2 


(24) 


if  31— iisr  proportion  based  on  independent  iteTTin  l and  g 

together, 


^12a  =22 


y,  -^Yiyo 

1 c: 


(25) 


Using  (24), 


'12a  - 


_ Cia  '-2a 


/rsC\ 


V’e  may  generalize  (2.5)  “ts  t items  thus: 


/ 2 

...  . (%./a 

\ \ 

2 ' 

i 

CO  ■ 
p! 

OJ  >i 
Pc 

' (I2...t;a  - 1 

1 ' 
/ 

...  1 

1 

^^(12. . .t)a  ~ “ 

/•o2 

r 

2 

Pa 

\%i  i 
\ 

tea  has  t.be  same  ability  to  reduce  uncertainty, 

/ rt  V ^ 

(I2..„t)a  1:J-} 

P 2 

i'a  = Pa 

f^la^ 

1 *^oa 
1. 

t 

1 

/ 

(27) 


(28) 


(29) 


■9 


0 


a prion 

9 


after  one 

V • \r 


after  t’.vo 
iten:S 


after  ohree 
itens 


( > 


’ c 


C, 


2x 


c 


3x 


\ 


xoo 


Figure  13*  Expecuaticn  cf  correct  classification  vuithin  any  category  x. 


As  in  Section  II,  vre  cay  inquire  now  many  uniform  independent  items  are 
required  to  raise  to  1.00  the  a nosterlori  confidence  in  classifying 
a person  as  a.  Following  (l)  , ecploy  a conditional  notation  to  wcrite 


Mviding  thsrough  (29)  by  p^,  and  setting  confidence  equal  to  1,  we  have 


(30) 


1 = ' 


p ' P* 
^ia  ' 

; 


-t-'a 


* log  P, 


log  p 


•n  = 

*“a 


loe 


^ai 


log 


log  Com  ^a,yg/  - log  Pg 


:3i) 


(32) 


This  formula  is  similar  to  (32)  of  Section  II,  and  demonstrates  that  the 
codal  of  the  present  section  yields  the  same  measure  of  number  of  independent 
items  required  in  sequential  testing  as  does  Shannon's,  so  long  as  we  con- 
sider persens  in  any  one  criterion  category.  The  weighted  average  of  n^ 
fro"  (32)  here  is  not  the  s'’  e as  n frci".  (32,  Sec.  II). 


'wfiile  (29)  can  be  summed  over  x 
a posteriori  confidence  after  t items 
unless  ( qp  rs  the  ssTTie  for  all 


categories 

y v-iis 
o 4*  £arr/-sT*T  , 

-w*  — • 


to  give  the  total 
iicn  ia  not  simple  :in  form 
Figiire  I3  shows  the  way 


in  which  confidence  increases  with  added  items.  This  may  be  compared  with 
Figiire  2.  Note  that  here,  instead  of  each  item  adding  the  same  or  less  than 
piectiuing  Items,  we  find  that  each  item  adds  more  to  average  confidence  (C) 
than  did  its  predecessor,  until  certainty  is  reached.  No  practical  implica- 
tions should  be  drawn  from  this  rather  startling  result  until  formulas  are 
developed  to  take  into  acco\mt  correlations  among  successive  itemss.  Our 
independence  conditions  require  that  itms  have-  a negative  correlatior,  with 
X held  constant,  and  this  is  not  usual  Jn  practice. 


FJxhaustiveness  and  Dependability 

In  lilt:  b^stcui  hare  discussed,  we  can  write  measures  of  e:<haust:.veness 
and  deperdablllty. 


50 


E (Exhaustiveness) 


Informetion  obtained  . '"1  “ 


Information  desired 


1 - qj 


If  we  wish  to  know  hew  completely  the  test  responses  are  specified  by  the 
criterion,  we  have  die  comparable  formula, 


B (Dependability)  = 


Px  - 2 p2 


T -2P^ 


Ail  the  argument  cf  pages  jl  - 33j  regarding  the  significance  of  J and  K 
for  validity  studies  may  be  carried  forvexu  lo  B and  D. 


Significance  of  Confideiice  Formulas 

In  summarizing  this  section,  we  should  note  that  all  the  implications 
of  Sections  I and  II  are  consisrent  with  the  equations  developed  here.  Thus 
nothing  cf  the  practical  meaning  of  the  Sharmcn  fcrmulation--insofar  as  it 
applies  to  testing--ls  lost  by  adopting  a measure  of  average  confidence. 


The  actual  formulas  presented  here  may  be  of  some  direct  use,  even 
though  Section  IV  will  emphasize  the  reasons  for  seeking  a treatment  of  test 
data  which  evaluates  confidence  on  a basis  other  than  chance  assignment  of 
men  to  categories.  Such  a method,  while  logically  superior  to  the  one 
developed  in  Section  III,  will  often  be  difficult  to  treat  mathematically. 

^Ln  working  out  functional  relatlonshirjs  to  determine  how  much  confidence  is 
increased  by  some  change  in  testing  proceaure,  it  may  be  desirable  to  employ 


.i.  U --  -Ti.,......—  .*!  j.  U -•  — - u-  j ^ .• ^ - .... 

A wx  dO  X XX  d U Cl^^X  X Wi  • -t 

OTin  1 e 


jO  iiviu  uuxxxo^ 


Our  res’olts  carry  implications  for  the  "discrimination"  formulas  others 
have  offered.  The  formulas  of  Ferguson,  for  instance,  should  be  regarded  as 
a measure  only  of  claimed  or  attempted  discriminations,  not  as  an  index  of 
the  useful  information  in  the  tost.  In  any  event,  those  formulas  are  appro- 
priate only  if  all  mi^sc'nssT  r-; -ctions  are  equaily  serious — as  is  probably 
not  true  for  ordered  categories  or  scores,  v.'hile  our  formulas  may  provide 
the  best  way  to  take  vaiiultj  into  account  in  discrimination  formulas,  our 
douuts  iegarding  the  average  cei-tainty  ijrmulas  suggest  tiiat  the  discriraina- 

+ r^'ry  -f*  v*n  r-,  *1  r*  /-\  <-•  .•I . •«  »>«  i V.  T ^ X ^ X 

V wx  uiC4..^uio  ux  ^ iriAxcx  V ‘xiiou.x  x vyx  ocou  d xo  • 


IV.  AN  im’KCDUCTORY  STA'J'EMENT  OF  Lri?ILITY  THEORY,  GIViyG  ITS 
IMPLICATIOi^S  FUGARDTNG  THE  I?fFORMATIOH  FOR14ULAS 


It  is  otir  purpose  here  to  indicate  the  nature  of  utility  theory.  We  are 
not  yet  prepared,  however,  to  give  a general  statement  of  the  utility  theory 
with  a well-tested  notation,  systeToatic  formolee,  and  the  like.  Section  V 
applies  utility  tlieory  to  a specific  limited  problem,  and  liiis  .-i'-i-tion  will 
prepare  the  reade;  for  that  example. 


xn^  coxicepofc)  xiioPOuUCc’u  xn  UoixioV  wncQi'y  — — ^ — — „ — w- 

value  of  the  formulas  based  on  H and  C,  presented  in  Sections  II  and  III. 


^ ^ n c 


•ho 


utility  Theory 

A utility  theory  is  an  attempt  to  state  the  benefit  derived  from  an 
operation  or  procedure,  by  comparing  the  goodness  or  utility  of  decisions 
based  on  the  proceduxe  I'ith  the  utility  of  decisions  made  without  the  pro- 
ced’ire.  Utilities  are  a key  concept  in  economic  theory,  and  in  theory  of 
games  and  strategy  (50,  5l) • 


Basic  Data  Required 

To  judge  the  utility  of  a test  (or  of  a proceaure  for  making  decisions 
from  the  test),  we  require  three  matrices  or  table-s  of  data. 


Transition  matrix.  The  first  matrix  is  a transition  matrix  relating  tne 
criterion  scores  x to  the  obtained  scores  y.  The  entries  in  this  matrix  may 
be  wx-iLLeii  in  the  form  of  either  joint  or  conditional  probabilities.  We 
shall  here  employ  conditional  piobabilitiesi  Table  1 (p,  I6)  shoved  such  a 
matrix.  Data  for  a transition  matrix  come  from  an  empirical  trial  where 
test  and  criterion  data  are  obtained  for  each  person.  The  margin  of  the 
transition  matrix  gives  the  distribution  of  x . One  problem  is  that  this 
matrix  is  based  on  san^jle  data.  Population  data  are  assumed  in  the  way  we 
have  coniDuted  utilities  below. 


Table  3,  Specimen  Evaluation  Matrix 
Assigned  Classification  (x’) 


b' 

c'  , = Iv’ 

e , ♦ 

G • ! 

Criterion 

au  • 

‘ j 

Classification 

(x) 

b 

•■^b’ 

* 

• 

• 

'ca  * 

®cb' 

I 


52 


Evaluation  iratrix.  The  second  mstrix  is  an  evaluation  matrix,  telling 
what  value  we  assign  to  each  sviccessful  or  unsuccessful  classification.  The 
t?st  infcriuation  (y^)  is  used  *co  assign  person  S a classificacion  x’.  The 
evaluation  matrix  consists  of  values  . For  exampl.e,  states  the 

value  (gain  or  loss)  when  a person  whose  criterion  classification  is  a is 
assigned  to  category  b.  A specimen  evaluation  matrix  is  given  in  Table  3. 


Evaluations  roust  he  determined  by  an  accounting  process  of  some  sort. 
Brogden  (5)  and  Brogden  and  Taylor  (6)  have  discussed  the  desirability  of 
intrcducing  such  utilities  into  test  analysis,  when  it  is  inro-ractical  to  fix 
evaluations  by  accounting  studies,  they  must  be  specified  by  the  judgment  of 
some  person  concerned  in  the  decision.  Some  setting  of  utilities  or  risks 
is  required  whenever  decisions  are  made.  Often  the  value  judgments  are  made 
implicitly.  Each  conventional  method  of  assessing  tests  embodies  some  par- 
ticular weighting  of  risks  or  errors;  the  user  of  the  formula  may  not  realize 
what  is  thus  assiuned.  Utility  theory  only  makes  th.ese  value  judgments 
explicit  so  that  they  can  be  reviewed  openly  and  altered  when  necessary. 


Interpretation  roat.■ri^;-  The  third  niatrix  is  the  interorsts 
Tile  interpretation  matrix  specifies  the  "strategy"  or  decision  function  to  be 
used  ili  assigning  persons  to  the  x’  categories,  'i'ne  rule  raignt  pe  tliat  all 
persons  giving  response  cx  ■'./ill  be  placed  in  category  a_;  or  that  some  per- 
centage of  them  selected  at  random  will  be  called  a's,  and  the  rest  b's;  etc. 
Each  entry  (sec  Table  4)  takes  the  form  of  a prob-ability  w^. Different 
interpretation  rules  give  different  benefits.  For  any  given  transition 
matrix  and  evaluation  matrix,  there  is  a best  strategy.  The  determination  of 
cutting  scores  in  Eectian  V illustrates  how  we  choose  a best  interpretation 
matrix  or  decision  procedure  for  a given  body  of  data. 


Calculation  of  Utility 


If  the  interpreta'Cion  matrix  shQ\ie  how  to  convert  y to  x',  and  the 
transition  matrix  shows  how  x depends  on  y,  then  we  may  construct  a new 
matrix  in  which  each  element  is  > the  joint  probability  of  x and  x'. 


Pxx’  ~ 2 Px  Py/x  "x'/v  ~ ^ J'xv  ’/y 
7 ' y “ 

Table  4.  Specimen  Interjiretation  Matrix 
Assigned  Classification  (x') 

» at  i • 


(1) 


o(. 

1 V t / . 

1 ^ /rK 

”b  > h 

Response 

1 ’ 

1 

Category 

(y)  ‘ 

1 

i - 'A 
1 

* / (x 

"b'/j 

The  v®valuation  aatrix  assigns  a value  to  each  xx',  and  the  value  of  all 
classifications  in  the  xx’  cell  is 


V 


XX ' “ ®xx'  = N 2 

y 


-XY 


V„t  /, 


(2) 


The  ous  oi'  all  over  all  x and  x'  gives  the  total  utility  ox  decisions. 

This  s'jm  nay  be  divided  by  N to  obtain  an  average  utility  per  person,  V.  VJe 
arc  using  V rather  than  U as  our  siTubci  here  because  U vas  en^loyed  previously 
in  this  report,  for  Uncertainty. 


V = 


222 


» P- 


w„  I /.. 
// 


/ 

I 


3) 


The  utilities  aay  be  interpreted  directly.  They  may  also  bo  eorigJHrcd  to 
the  gsiTi  over  cone  other  stiuLegy,  or  over  a chance  decision.  This  gain  may 
be  divided  oy  the  maximum  possible  gain  to  obtain  a x'-elativc  utility,  RV. 


RV 


(M 


Here,  la  the  '.'alue  obtained  when  every  person  in  a particular  category 
X is  assigned  to  that  x'  for  vhiclx  is  greatest,  i.e.,  when  we  make  the 
most  profitable  assignment  possible. 

Table  5 draws  attention  to  basic  similarities  between  these  types  of 
measures  and  measures  of  prei-eding  sections. 


Table  5*  Con^jarable  Concepts  in  Three  Syatams 


Measure  of 
Section  III 

Quality  of  

decisions  at  i 

end  of  testing  j 

i 

Gain  over  a priori  j hV  ^ - Vq 

situation  i 


Measure  of  Meas\ore  of 

Section  I Section  II 


H, 


R = Hq  - 


U, 


uQ  “ Ux 


Gain  relative  to 
possible  gain 


V 


max 


It  will  be  noted  that  we  are  ab.l.e  to  take  Into  account,  in  utility  measure, 
the  implications  of  preceding  sections:  For  instance,  the  suggestion  that 

the  test  be  evaluated  in  terras  of  gains  over  the  best  a priori  decision  (p.  13 ) 
is  embodied  in  the  formula  ZiV.  The  fact  that  exhaustiveness  is  not 
synonjTnous  with  rate  of  information  is  represented  in  the  distinction  between 
/yV  and  RV.  Dependability  wij.l  require  more  indirect  treatment  In  utility 
theory  than  in  the  information  theories. 


Review  of  the  Average  Confidence  Foirraulas 


Every  method  of  evaluating  a test  in  some  way  specifies  an  interpreta- 
tion plan.  For  instance,  correlational  analysis  evaluates  the  goodness  of 
decisions  when  persons  are  assigned  on  the  basis  of  a regression  line.  Also, 
every  method  embodies  some  set.  cf  evaluations,  (in  correlation,  for  example, 
the  .mean  square  error  is  determined;  thus,  an  error  cf  two  points  is  comited 
as  four  times  an  error  of  one  point.)  Other-  fonrulas  may  be  viewed  as  special 
cases  within  utility  theory. 

The  formulas  of  Section  III,  based  c*'  coun^ 

as  equal  in  value,  and  all  errors  as  equal  in  value,  •[•hus  the  evaluation 
matrix  is 


x' 

b'  c'  ... 


X 


0 0 
1 0 


c 


1 


This  is  clearly  not  always  the  most  suitable  evaluation  matrix  for  a given 
problem. 


The  interpretation  matrix  assumed  in  Section  III  for  determining  the 
expected  number  of  hits  sets  each  w.. . / equal  to  the  correspondins  Pv-'v 

This  distributes  persons  having  a given  response  according  to  the  known 
expectancy  cf  each  x category  in  the  group.  This  is  not  ordinarily  the 
best  sti-ategy  available.  Our  number  of  hits  is  always  g7.-eatest  if  we'  assign 
all  persons  giving  response  CK,  to  that  x'  for  which  the  corresponding  p^4;^ 

io  greatest.  Some  other  strategy'  may  be  superior  to  this  whan  evaluations 
are  introduced.  We  cannot  expect,  in  general,  that  the  chance  assignment 


bv  w . / 

“ X ■ / y 


TV  / VT  11  rrS  If  A +Hcv  ^..4- 

■^X/  '/  O-  • - — 


We  conclude  that  C,  AC,  and  related  foriuulas  are  not  oatisfactory’- 
measiures  of  the  goodness  of  a test.  This  criticism  also  applies  to  the 
"discriminating  power"  formulas  reviewed  in  Section  III. 

Review  of  the  Average  Log  Confidence  Formulas 


We  can  examine  the  formulas  based  on  H from  the  same  point  of  view. 

— NH  is  the  logarithm,  it  will  be  recalled,  of  the  probability  of  correctly 
classifying  an  indefinitely  large  configuration  of  persons,  drara  at  random 


.-NH 


from  the  population  specified  by  the  Thus  2'"*"  may  be  regarded  as  a 

utility'-  aeasiwe  making  the  same  assumptions  as  are  involved  in  C,  except 
that  the  evaluation  and  interpretation  matrices  now  are  to  apply  to  the  whole 
very  lon^:  sequence.  Tiie  evaluation  matrix  looks  like  the  one  shown  above, 
but  now  is  based  on  a sequence  v instead  of  a single  person.  Here,  then,  a 


V C4X.  «. 


V 


t 


a • b ' c. ' 


V 


1 

0 


0 c 

1 0 


c i 

I 


0 


0 


]. 


The  interpretation  matrix  implied  is  net  c_  .i  to  the  critioiaui  advanced 
in  connection  with  average  conl'idence  formuias.  In  interpreting  a Birigle 
response  or  finite  set  of  responses  there  usually  will  be  a set  of  w^7 

differing  from  the  corresponding;  Px/y  which  gives  greater  average  ccnfidence 
than  the  = Px/y  This  is.,  however,  not  true  for  the  infinite 

bey,aeiices  oi  iiidependciit  uieSt.-ages  ueca  ,ut5e  all  ax'e  eyaal,  and  every 
is  1 or  0,  or  equal  to  Py  V/hen  evaluations  are  as  shown  above,  interpreting 
an  indefinitely  long  sequence  of  responses  y by  assigning  persons  to  the  con- 
fix,orations  v’  so  that  Wv'/y  equals  the  corresponding  Pv/v  is  as  good  as  any 
other  strategy. 

]!t  is  highly  unlikely  that  the  evaluation  matrix  shown  will  ever  fit 
the  testing  problem.  Questions  were  raised  above  as  to  the  use  of  equal 
values  for  ail  hits  and  equal  values  for  all  errors.  Apart  from  this,  the 
matrix  shown  <?bove  counts  an  assignment  as  of  zero  value  unless  every  element 

— NH 

in  the  sequence  is  correct.  That  is,  in  the  tesTing  problem,  2 raeasures- 
utility  only  if  we  regard  it  as  of  zero  value  to  classify  N-1  persons  properly 
so  long  as  the  Nth  is  vTongly  classified. 

Since  we  find  that  the  Shannon  H and  R measures  may  be  used  only  approxi- 
mately in  evaluating  sequential  testing,  nay  not  be  interpreted  in  terms  of 
coding  in  our  situation,  and  may  not  be  translated  reasonably  into  utill^.ies, 
we  arc  abandoning  attempts  to  derive  useful  formulas  for  test  analysis  from 
tne  Sharuioii  treatiusrit.  Ve  do  not  regard  it  as  sound  to  follow  Kick,  for 
examqjle,  in  his  pi'oposal  (19,  p.  162)  to  treat  information  measures  fci’mally 
to  determine  the  best  design  for  a test; 

w'e  have,  of  cc’..Lrse,  profited  from  the  conceptual  leads  suggested  by  the 
liixormai-xun  analogy,  but  believe  these  can  best  be  formul-ated  in  utility 


56 


V. 


ANAJ.,YSIS  OP  A PSYCuIATRIC 


GCREPNim  TEST  BY  UTILITY  THTiORY 


This  study  cf  the  value  of  a psychiatric  screening  test  is  a demonstra- 
tion of  some  consequences  cf  nn  approach  through  xxtility  theory.  Utility 
theory,  as  outlined  in  Sec-cion  IV,  studies  the  value  cf  a test  by  taking  into 
acco-ont  the  probabilities  of  making  correct  inferences  from  the  test  to  the 
criterion  and  the  value  (benefit  or  cost)  of  each  correct  and  incorrect 
inference.  If  we  apply  this  approach  to  typical  test  data,  we  identify 


-noiri-fQ  r»r>-p 

for  a test. 


validity  coef  1 icieui/  is  i-eporoeu 


A test  yields  a score  or  pattern  of  responses,  'This  "output"  is  trans- 
lated by  an  interpretation  formula  of  some  sort  into  a proposed  decision. 

In  order  to  judge  t’\e  worth  of  the  test,  ve  must  have  an  indication  of  the 
correct  decision  for  each  person  in  the  validation  sample.  For  each  possible 
pairing  of  actual  decision  with  correct  decision,  ve  also  take  into  accoimt 
an  evaluation  which  expresses  the  seriousness  of  any  error  or  the  gain  from 
any  correct  decision.  The  net  value  of  the  test  is  expressed  in  terms  of  the 
value  of  the  decisions  based  on  the  test,  as  compared  -:izn  the  decisions  that 

^ r..  * V V.  ^ ^ »»•— . -3  m.  '•  mm  !•«... -4-  ^1——  4.  _>-4. 


It  was  recognized  dui’ing  World  War  II  that  the  correlation  coefficient 
is  not  well  adapted  to  reporting  t.ae  worth  of  a psyerdatric  screening 
test (28).  The  practice  was  i.hen  introd'aced  of  reporting  the  effectiveness  of 
a screening  test  in  a more  complicated  form.  For  a test  used  to  screen 
recruits,  the  typical  report  cf  that  period  states  for  each  cutting  score  the 
resulting  number  of  hits  (correct  decisions),  misses  (deviates  reported  as 
normals)  and  false  positives  (normals  reported  as  deviates).  This  rather 
complex  report  permits  any  user  to  evaluate  the  test  for  his  purposes. 


To  illustrate  a simple  utility  analysis,  we  shall  examine  here  the  PiDKC 
screening  test,  developed  to  detect  probable  psychiatric  casualties  among 
Nnv?’  rrcruite.  Such  a test  was  needed  because  of  the  importance  of  identi- 
fying men  who  would  probably  become  psychiatric  casualties,  and  because  the 
large  intake  of  men  overbui'dened  psychiatrists  who  might  otherwise  liave  made 
this  judgment  in  an  individual  interview.  The  screening  test  was  introduced 
as  an  adjionct  tc  the  psychiatric  interview  in  order  to  reduce  the  number  to 
be  interviewed  carefuliy.  Validity  data  for  the  test,  comparing  the  test 
scQi'c  -with  a nsychiatriBt ' s judgment  on  each  man,  are  reported  in  the  Summary 
Tech-nicel  Report  of  I'JDRC  (27) . (The  table  of  data  we  employ  comes  from  a 
secondary  source,  and  probably  contains  some  very  small  inaccuracies.) 

Table  6 shows  the  distribution  of  test  scores  for  the  two  criterion  srouns . 


One  emphatic  caution  needs  to  be  expressed  regarding  the  present  analysis . 
We  have  data  cnl^  for  a dichotomous  criterion,  and  our  method  of  analysis 
assumes  a dichotomy.  But  not  all  those  judged  "normals"  by  the  psychiatrist 
are  equally  free  fi'om  troublesome  E^,cnptoms,  end  not  all  "deviates"  are  equally 
undesirable  'vc  the  seivice.  Our  analysis  necessarily  fails  to  give  the  test 
credit  for  any  power  it  has  to  make  practically  significant  discriminations 
within  criterion  groups.  Gui-  present  demonstration  serves  only  tc  chow  the 
general  method.  It  presents  the  conclusions  of  an  analysis  -with  a discrete 
two-point  criterio!’.  but  this  enalysic  is  undoubtedly  oversimple  for  ultimate 
evaluation  of  the  screening  test. 


Tn  V)  1 »• 


57 


, Raw  Data  Indicating 
jlDRC  Screening 


the  Validity  of  the 
Test 


Freq’ocncy  of  score  among  neii  .judged  by 
psychiatrists  to  be 


Score 

Deviates 

Normals 

Totn 

19 

1 

1 

i'j 

17 

1 

16 

1 

1 

15 

1 

1 

14 

T 

3 

13 

X 

1 

2 

12 

1 

1 

11 

5 

1 

r 

10 

3 

5 

3 

5. 

4 

7 

8 

1 

3 

Cl 

y 

7 

r\ 

C. 

13 

20 

6 

24 

24 

c; 

V 

4l 

44 

4 

1 

55 

57 

3 

t 

u 

8o 

84 

2 

1 

96 

97 

1 

C. 

07 

^ 1 

99 

0 

T 

22 

Total 

30 

\ r-O 

4 So 

CorrciationEl  Validity 


The  conventional  metliod  of  susina- i^ing  valic  ity  in  a single  index  would 
be  to  compute  point -biserial  r.  For  Ihese  data,  ==  •>35^-  Such  an 

index  is  not  very  satisraotory,  hcwevery  because  the  result  would  be  altered 
if  the  scale  of  ncaf-iurement  of  the  test  were  transforuied,  for  instance,  into 
normalized  scores  (cf.  hrogven,  ^)-  nuch  a transformation  ’JO’old  not  alter 
the  effectiveness  of  the  test  as  a screening  device,  and  so  the  change  in 
coefficient  would  be  misleading.  In  contrast  to  this  cneffi cient , the  value 
derived  by  ’utility  analysis  does  not  change  whexi  the  measuring  scale  for  the 
test  is  transiorraed. 


A second  correlational  index  of  some  merit  -would  be  the  phi  coefficient, 
obtained  after  dichotomizing  the  test  at  some  cutting  point.  The  value  of 
phi  varies  as  the  cutting  point  changes.  The  only  unique  -/alue  of  phi  is  the 
maximum,  obta-ined  by  cutting  so  as  to  divide  the  test  in  categories  having 
the  same  proportions  as  the  criterion  categories.  According  to  Table  6,  this 
would  call  deviates  all  persons  with  a score  of  10  or  over,;  and  5/7  of  those 
at  9-  Fur  this  particular  cutting  score,  the  four! old  contingency  table  is 
8S  shovra  below  and  phi  is  .hyi . 


58 


Assigned 

Category  (x*) 

Deviate 

Normal 

True  Deviate  | l4  1/7 

T5“6/T  'j 

30 

Cc-tegory  j 

(k)  Normal  j 15  6/7 

1 

1 

hk2  1/7  j 

t 

J+58 

J'' 

!ic« 

• 

488 

We  may  point  cut  two  features  of  correlational  validity. 

1.  It  is  expressed  on  a scale  ranging  from  -1  to  +1,  on  which  zero 
represents  the  correlation  that  would  he  expected  if  persons  were  divided 
into  categories  by  chance. 

2.  Nc  explicit  assumption  is  made  regarding  the  comparative 
seriousness  of  misses  and  false  positives.  The  extent  to  which  phi  is  re- 
duced by  an  additional  error  of  either  type  depends  on  values  in  Phe  fcurfcld 
table.  For  most  tables,  phi  essentially  couiits  each  type  of  error  as  eqiially 
serious . 

Utility  Analyais 


Introduction  of  Evaluations 


In  practice,  a cuthLiig  score  is  fixed  by  considering  the  relative 
seriousness  of  misses  as  opposed  to  false  positives  (or,  in  selection  gen- 
erally, of  incorrect  "accept"  decisions,  as  opposed  to  incorrect  rejections). 
If  false  positives  are  very  damaging,  we  set  a high  cutting  score;  if  we  would 
rather  risk  false  positives  than  misses,  we  set  a low  cutting  score.  The 
optimal  cutting  score  depends  upon  the  values  placed  on  these  errors. 


We  prepare  an  evaluation  matrix  of  this  form: 

Assigned  Category  (x') 
iievia'Ce  IJonual 


Tx  Uc 


Ha  ^ 


(x) 


Deviate 

Normal 


^'dd 

^id 


e 


nn 


Evaluations  may  be  positive  or  negative.  Here,  is  the  value  associated 
\ri.th  8 correct  decision  dcviate-called-deviate.  is  the  value  associated 

with  a "miss"  (this  value  is  negative,  because  this  decision  is  costly),  and 
is  the  value  of  a ‘‘false  positive. “ 

Since  we  will  evaluate  the  test  by  the  gain  in  goodness  of  decisions,  ve 
mfc-y  add  or  subtract  any  constant  to  both  entries  in  a row  with  no  change  in 
our  end  results,  so  long  as  the  criterion  proportions  remain  fixed.  In  the 
first  row,  we  find  it  helpful  to  subtract  e^in  fi'om  beth  entries;  in  tb.e 
second,  we  subtract  This  yields  the  equally  useful  matrix: 


59 


Assigned  Category  (x') 
Deviate  Normal 


V Jwiia  t 

Category 

(x)  Ncrnal 


nd 


«dn 


u 


'^riTi  ■ ^'nd 


A Ti.seful  E’jTnmax'y  of  such  an  evaluation  matrix  for  a dicnotciny  Is  the 
6vfi3_v.3*^  1-on 

E.  Fv.  = (e^(j  - e^'j)/(^nn  " 

If  we  can  count  e.-,^  and  e as  zero.  E.  H.  is  the  ratio  of  the  cost  of  misses 

vo-U.  .lU  ' 

to  that  of  false  positives. 

Dett>rmining  entries  for  the  evaluation  matrix.  Values  such  as  ^-id' 

etc . ^ are  always  taxen  xnto  account  In  setting  cuLUing  scores  even  when  the 
test  user  does  not  recognize  it.  for  he  must  decide  on  the  acceptable  ratio 
of  misses  to  false  positives,  and  this  inrplles  a valuation.  The  evaluation 
ratio  i.s  very  similar  to  the  risks  specified  in  making  decisions  on  the  basis 
of  an  experiment.  There  one  is  asked  to  state  what  risk  he  is  willing  to  run 
of  accepting  a false  h'/pcthesls  (Tyne  I error)  anJ  what  risk  of  re.lecting  a 
true  hypothesis  (Type  II  erior) . These  risks  are  derivea  Dy  the  user  on  the 
basis  of  come  notluxi  of  the  practical  cost  of  the  two  sorts  of  error.  The 
veightirig  of  the  risks  needs  to  be  brought  out  explicitly,  for  otherwise  it 
cannot  be  examined  end  criticized. 


Without  better  data  than  W3  now  have,  v^inc.?!  in  the  evaluation  matrix 
will  always  involve  some  arbitrary  judgments.  To  arrive  at  - e^j^,  we 
estimate  the  gains  and  losses  (a)  if  s deviate  is  called  deviate,  and  (b)  if 
he  is  called  normal  on  the  basis  of  the  test.  Suppose  recruits  called 
deviates  are  dropped  from  service;  suppose  normals  are  sent  ■'nro  basic 
traiiiing.  Tae  deviate-cailed-normai  brings  certain  benefits  and  costs.  We 
expect  him  to  perform  duties  as  a sailor  which  have  value.  He  will  continue 
until  he  gets  into  disciplinary  trouble,  is  picked  up  on  sick  call,  or 
breaks  clo’/xn  1n  combat:  Now  vje  must  assess: 


Gain  from  vxork  capably  performed  until  discharge 


Cost  of  training 


Costs  arising  from  symptoms  of 
investigate  disciplinary  probleins, 
errors,  loss  in  efficiency  of  bis 
noi’aally,  etc . ) 


his 

t Liiie 
unit 


disorder  (time  required 
of  medical  staff . cost 
CHS  c)  ilis  oO  p 


to 

of  hib 

cjorfOjriT} 


Cost  of  treatment  for  which  the  service  takes  responsibility 
(pension  for  service- incurred  disability,  etc.) 


The  fact  that  these  (and  other)  costs  are  likely  to  outweigh  the  benefits 
from  his  service  l.s  the  reason  for  desiring  a psychiatric  screcnixig  device. 
In  the  absence  of  accounting  figures,  these  values  are  estimated,  just  as 
they  are  implicitly  estimsted  in  setting  cutting  scores  at  present.  The  sum 
of  gains -minus -costs  for  this  deviate-called-noraal  gives  us 


59a 


n ^ 

Likel-'hooc',  rr.tic"  ' n/y  ( loj-'nritrrri'.-  sc=J--=; 


S CO  :’e 
(y) 


.gure  14.  Likelihood  reties  at  various  cutting  .scores 

(sr  ioothei) 


For  the  deviate-called-deviate,  ve  may  take  as  zero.  He  is  rejected, 

and  there  is  neither  gain  nor  cost  from  him.  A more  precise  aaialjsis,  as 
Bi-Ogden  shows  (5),  should  take  the  cost,  of  testing  into  account  in  each  cell, 
hut  this  factor  -would  be  a distracLio-n  in  thit;  introductory  report.  Uhen 
e^jj  and  e^jjj  are  fixed,  - e,^T,  is  determined.  As  we  have  set  e^,^^  equal  to 

zero,  “ «dn  = ' ®dn' 

A similar  analysis  would  be  made  for  normals  to  get  - ®nd’ 

Likelihood  Ratios  as  a Basis  for  Cutting  Scores 

V/e  compute  a likelihood  ratio  (L.E.)  for  each  test  score  as  a convenient- 
way  to  examine  Table  6.  F'lr.st  ve  divide  the  frequency  of  normals  at  any 
score  by  the  frequency  of  deviates.  If  y is  a test  score, 

= P„^/Pdy  - V^/yhd/y 

At  score  1-h-  the  T-iV^l-ihood  ratio  ic  l.oT.  These  likelihood  ratios  are 
unreliable,  especially  where  the  frequency  of  deviates  is  low.  We  smooth  the 
curve  on  the  assumption  that  this  will  yield  more  acciarate  estimates  of  popu- 
lation values  than  the  points  based  on  the  sample.  Points  for  the  smoothed 
curve  are  obtained  by  pooling  adjecenr-  scores,  when  the  .fi-eLjueucy  pex  score 
category  is  snail.  Thus,  the  ratio  at  score  10  can  be  estimated  by  pooling 


data  for  9,  10,  and  11  given  in 


IS  . — — k or  .91. 

3 + 3 + > 


me 


lege  of  the  ratios  .or  ratios  plotted  on  semilogarithmic  paper)  rx'e  then 
plotted  against  score.  The  logarithms  make  for  better  smoothing.  The  best 
fitted  smooth  curve  gives  the  final  estimate  of  likelihood  ratios.  Theoret- 
ically, the  best  curve  may  have  one  or  more  dependable  maxima  or  -miniaia;  if 
so,  ve  may  require  cutting  scores  on  both  sides  of  the  maximum  (minimum) . In 
th-is  test,  where  no  such  maximum  exists,  a single  optimum  catling  score  can 
be  determined  for  each  S.R. 


Choice  of  Cutting  Score 


Fig’ore  l4  shows  tliat,  in  a further  sample,  at  every  score  level  we  may 


ex-DC'Ct  rornals  to  exceed 


Lates.  In  j.able 


deviates  exceed  normals  at 


some  scores  (e.g.,  11);  this  is  almost  certainly  due  a sampling  fluctua- 
tion. Figure  1’;  gives  a likelihood  ratio  at-  each  score ; thus,  according  to 
the  chart,  \re  may  expect  L.R.  to  be  1.6  for  persons  earning  score  1C  in  the 
population.  If  we  classify  persons  at  score  10  as  normals,  we  will  be  right 


x.O  uiiiies  iOr  every  er; 


uevxaoiis.  We  ..ui;i.e  x.O  6!Ti0.l3 


3e  positxves;  for  every  hit. 


The  best  cutuing  score  we  can  select  is  that  at  which  L.R.  equals  E.R. * 
At  any  score  y such  that  L.R.,.,  > E.R.,  people  should  be  called  normals;  and 
where  h.R.y  < E.R.,  they  should  be  classified  as  deviates. 


We  can  show  this  by  considering  three  evaluation  ratios:  1:1,  2:1, 

1.6:1.  If  L.R.  is  1.6  at  .10,  our  probability  of  misclassification**  end  our 
’■Utility  per  decisi-on  regarding  persons  who  r-eceive  score  10  are  as  shown  in 


Table  ?•  Expected  Probability  of  Errors  and  Utility  per 
Decision  for  Persons  Whose  Score  Is  10 


Pxjc' 

Utility 

( EZ  1 ) when  E.R.  is 

Deviates  Hormal 
deviates  normals 

1:1 

1.6:1  2:1 

10 *s  are 

called  normal 

.615 

.6l5c 

.615c  ,6l5c 

10 's  are 

called  deviates 

• 365 

•38.6c 

.615c  ,770c 

Here  c is  a positively  valued  constant,  equal  to  It  is  apparent 

that  if  E.R.  > 1.6  (i.e.,  Vv’’hen  identifyir.g  deviates  is  relatively  important), 
we  gain  most  by  calling  10 's  de'^iaoes.  When  identifying  deviates  is  rela- 
tively unimportant  compared  to  avoiding  false  positives,  we  gain  most  by 
celling  ail  10 's  normal.  If  E.F.  = 1.6,  persons  above  10  .should  be  called 
deviates  and  persons  hel 

ov  10  cbUsI  iior^.sls  t pG^rsons  si  TO  'ds  i-i.f'iitj'i 

either  way. 

According  to  the  figure,  the  optimal  cutting  score  for  these  data 
shif^ts  very  markedly  as  E.R.  shifts  from  1:1  to  1.^:1.  Over  a range  of  E.R. 
from  1,5:1  to  30:i;  ihe  cptimai  cutting  point  shifts  slowly.  The  curve 
cl'anges  slope  again,  and  there  is  no  score  for  which  L.R,  > 39;l. 

The  selection  of  a cutting  score  may  be  very  sensitive  to  the  E.R,  We 
shall  show  later  tixat  one  pays  a substantial  price  for  using  a non-optimal 
cutting  score.  Hence,  the  price  -of  misses  and  false  positives  must  be 
judged  carefully. 


*-A  proof  for  this  relation  has  been  developed  by  Goldine  Gleser. 
"^■In  the  population  represented  by  t.he  smoothed  curve. 


6l 


Optimal  strate.^y  at  high  and  Icv"  E.R.  Fig'.u’e  l4  showed  that  there  is  no 
optimal  cu-t-tlng  score  for  evaluation  ratios  1:1  or  lovrer,  nor  for  iiO:i  and 
higlier.  This  means  that  if  ~ end  ' ^‘dd  ’ greatest  possible  net 

gain  (unless  a better  teso  ij  used)  will  result  when  we  oan  ail  men  normals. 
If  " e^,  the  irisest  decision  is  to  call  all  men  deviates, 

(This  is  probably  a decision  ve  cannot  allow,  for  we  mu-'t  accept  some  men 
Into  the  service  even  if  it  is  "a  losing  proposition.")  Our  analysis  to  this 
point  has  considered  all  strategies  allowable.  V.-e  shall  later  iiave  to  intro- 
duce limitations  on  the  strategies,  but  for  the  present  we  shall  continue 
with  no  limits  on  strategy. 


Comparison  with  Conclusions  of  Correlat:.onel  Analyst: 


T+  OXi  CSl  5l2T^blw  IsllST  i'C  bs  1 1 

normal,,  when  deviates  are  known  to  be  present  and  our  test  has  some  validity. 
Let  us  examine  this.  If  every  man  in  this  sample  is  called  normal,  we  have 
30  misses.  Suppose  instead  we  call  the  top  30  men  deviates,  as  in  our  earlier 
table  for  finding  phi.  Then  we  have  Ip  6/7  misses  and.  15  6/7  false  posi- 
tives: the  chanse  in  rrasber  of  errors  is  small,  but  we  do  have  more  errors 
than  before.  A cut  at  this  point  would  be  wiser  than  passing  every  man  only 
if  false  positives  ere  cheaper,-  i.e.,  more  desirable  than  misses.  A "valid” 
test  may  have  no  utility,  at  least  for  a discrete  criterion.  It  is  there- 
4. Ox e c3,^pax ein>  t'^iat  tne  oonolnoion  as  tv  wn'zo.j.crx  a Lest  should  be  used  cannot 
be  based  solely  on  a correlational  validity  coefficient. 

Judgment  by  utility  analysis  adds  to  that  offered  by  correlational 
S^ilulysis  uecuuse  the  lub'cer  cOiupares  decisions  ba-ed.  on  the  test  Only  with 
decisions  based  on  chance.  If  3^  persons  were  called  deviates  by  criance,  we 
trould  b.ave  25  false  positives  and  25  misses,  and  the  test  did  indeed  improve 
on  this.  But  a test  is  used  because  ve  want  to  improve  on  the  best  decisions 
we  could  make  ’without  the  test.  The  best  a priori  decision  (when  no  data  on 
individuals  are  at  hand)  is  vrhatevor  aeclsion  yields  maximum  ’ut ill 'i'-y  \vith  the 
expected  probabilities  in  the  ci'iterion  categories.  If  w^  cal^  everyone 
normal,  our  errors  cost  us  less  than  any  chance  assignment,  'until  E,R.  rises 
fo  (^58/35  = 1^3‘l)*  At  f;,R.  higher  than  this,  our  optimxun  a priori 

colution  is  to  call  everyone  a deviate. 


For  teste  having  a finite  range,  there  will  alniost  always  be  a maxiraum 
and  ffilnlmujn  E,.R,  beyond  which  we  should  not  use  the  test  in  selection.  This 
principle  has  not  been  pointed  cut  in  earlier  studies  because  models  for 
studying  test  valldltv  hgve 

have  unlimited  range.  Fvi-’-her  ^T.nOy  ic  required  to  determixae  what  may  occ'uu 
when  the  criterion  range  expressed,  in  utility  units  is  essentially  unlimited 
but  the  test  distribution  has  finite  limits.  It  appears  that  when  both  test 
and  critv-xion  are  unliraited  in  range,  a test  having  validity  above  chance, 
will  always  have  some  utility  (thougn  perhaps  not  greater  than  t.he  cost  of 
testing) . 


Utility  Curves  .for  Variou-s  Rtretegies 

Decisions  with  the  Test  Compared  to  a priori  Decisions 

In  order  to  demonstrate  the  relation  of  strategy  to  utility,  we  shall 
Bcudy  tvro  otrategiec  for  interpreting  t.r.G  test  results,  along  with  the  three 
conceivable  a priori  strategies. 


bla 


Strategies  to  be  coKoared.  The  g priori  strategies  are  as  follows; 

a.  A priori  chance.  Thirty  pe'^sons  at  random  called  deviates. 

h.  All  persons  called  deviates = This  is  an  optimuni  a rriori 
strategy'-  for  all  high  E.R. 

c.  Ail  persons  called  normals,  the  optimum  a -priori  strateg:/ 

^07*  °X*L  E-P.  • 

The  a posteriori  strategies  are: 

d.  Thirty  highest  scoring  persons  called  deviates  (best  cutting 
score  for  h.R.  1.6:1). 

a.  Persons  above  4 called  deviates  (best  for  E.R.  30*l)* 

in  obtaining  utility  curves,  we  must  plot  evaluations  rather  than 
evaluation  ratios.  Because  many  sets  of  e:)^,  etc.,  yield  the  same  E.R.,  we 
need  to  select  certain  values.  In  the  absence  of  actual  utility  data,  it  is 
most  convenient  to  employ  a scale  with  the  following  definition: 

^nd  equal  to  zero,  = E.R . • e-,.-, 

Tlie  utility  for  any  strategy  is  then  the  weighted  number  of  hits  it 
allcn/s . 


tr  _ M c»  n - ^ a 

' -'dd  -nn  ' *Tin  '^nn 


Here  is  the  niunber  of  deviates  called  deviates,,  etc.  We  shall  plot 
^/®nn>  recognizing  that  ws  have  no  fixed  'Vnit  of  measvireiiient . The  units  on 
the  scale  are  not  to  be  regarded  as  equal  from  one  E.R.  to  einothcr. 

V/enn  allows  us  to  compare  strategies  only  within  any  E.R. 


Results . f igure  15  presents  utility  f-onctions  for  uhe  strategies, 

T f £X  ^ ..-...♦..-1.  »- 


rne  .Line  iace.t.iec  riax  show': 


\T 


would  take  If  the  test  were  able  to  make  perfect  classifications.  In 
Fig'ure  l6,  we  plot  the  same  data,  givi.ng  a different  set  of  cracs-accticriS 
of  the  tliree  dimei;sional  surface  relating  V,  E.R.,  and  strategy.  Here  E.R. 
IS  fixed,  to  show  V as  a function  of  strategy.  The  same  data  are  show-n  thr^ 
dimensionally  in  Figure  17 • The  interpretation.s  to  be  derived  from  Figures 
15  “ 1?  a2*e  as  follows; 


a.  Among  the  a priori  strategies  either  b or  c_  has  greater 
utility  than  a,  for  every  E.R.  except  15.3‘i*  At  this  point, 
all  classification  schemes  not  enploying  test  data  are  equally 
good. 

b.  As  E.R.  becomes  very  small,  b crosses  above  every  other 
str.ategj-';  as  E.R.  becomes  very  large,  c crosses  above  every 
other  strategy,  xhus  ’it  extreme  E.R.  the  test  gives  us  no 
gain  in  utility. 


61b 


4 


Figur3  15.  UtFLitiss  of  various  strategies  as  i fuact;.on  of  evaluation  ratio 


I 


c: . Ar  'Fl.R,  becoap'^  sac’.r.lej:  snd.  « vai”s  of  E«R.  is 

reached  vhere  decie.Lons  based  on  a particular  cutblng  score 
give  less  utility  than  the  a px'iori  chance  classification 
For  some  strategies  (including  d)  this  occurs  only  for  nega- 
tive E = R 

d,  Each  cutting  score  gives  greaLez’  utility  than  alternative 
strategies  in  that  section  of  the  chart  wii-ere  SiE.  ? close 
to  the  L.R.  corresponding  to  that  score. 


I <•  1 c*  c < 


Figiure  l8  reorganises  some  of  the  sacie  information.  Consider  how  much 
gain  ever  the  best  a priori  strategy  the  test  proviaes,  and  divide  by  the 
possible  gain.  This  gi”«»s  e relative  utility  (RV)  for  each  strategy. 


RV  = 


V - 


max 


Yq  is  the  maximum  utility  obtainable  before  testing  (strategy  b or  c ) .; 
is  E.R.  + RV  is  comparable  to  the  concept  of  exiiaustiveness  in 

Sections  II  and  III. 


of  t,,H.  and  cutting  score;. 


6le 


''it:  are  18. 


Irproverent  rvjleti’.'e  to  possible  irr.provenient  for  three  strategies 


Figure  13 
apparent 
v'orth  of  dec  is 
e.vna  -o  tiveness 
utility  when  F. 
■ooPul.atior; ; 


plots  I'elative  utilities  for  strategies  d and  e*  It  is 
.1  e::tr?r.c  F.R.  the  screening  test  does  not  improve  the  net 
or.s,  and  t^at  a given  cutting  score  will  have  quite  different 
depending  on  the  E.R,  Each  surategy  has  its  greatest  relative 
R*  correspcn'ls  t-o  the  prcpcrtiOii  of  normals  to  deviates  in  the 


Misses  are  usually  far  more  ejoienslve  in  psychiatric  screening  tiian 
false  positives,  at  least  so  long  as  ve  are  dealing  with  unselected  recruits 
rather  than  specialists.  In  vie^f  of  ;.he  trouble  a misfit  or  breakdown  can 
cause.  E.R.  may  well  be  20:1  or  tO:i,  Our  judgment  of  the  value  of  the  test 
wij  1 -..e  much  less  favorabie  if  E.R..  i=  very  high  than  if  E.R.  is  moderate. 
Decisions  sbovt  s.R.  are  of  major  importance  in  test  evaluation.-  It  appears 
tha inv e_c I Igetors  are  unwise  to  dismiss  this  problem  casually  by  asbuming 


iir*. 


in  seriousness 


tcfylse  •oosltlvcs  (cfi  2)  = 


Limited  Strategies 

.Instead  of  having  the  unlimited  strategies  so  far  considered ; a tester 
may  be  restricted  by  practical  circumstances.  He  would  often  be  requii\ed  tc 
accept  some  minimum  number  of  men,  and  then  strategy  b,  and  possibly  also  £ 
would  be  disallowed.  Our  rormu-'i-r-ior.  can  handle  this”problem,  and  the 
result. s are  informative. 


3upx>ose  that  at  least  kOO  men  out  of  46c  must  be  passed.  Then  c is  the 
best  a priori  strategy  for  E.R.  < Ip. 3*  The  best  a prion  strategy  if 
E.R.  > ly.3  is  to  re.icct  just  as  many  men  as  we  are  allowed  to  reject.  Trie 
a or lor i utilit2r  so  attained  will  be  less  than  vhau  the  tester  is  ellowsd  to 


6?. 


reject  everyone;  ac  a consequence  the  test  used  with  uny  particular  cutting 
score  can  contribute  more  at  any  high  th?.r>  it  di(^  when  strategies  were 

unlimited . On  the  other  hand,,  cutting  scores  below  6,  which  are  the  raont. 
promising  for  high  S.R.,  ere  not  now  allowed. 

Whcii  we  are  required  to  reject  6.35>>  (30/488)  of  the  reci'uits,  neither 
more  nor  less,  we  r;ay  r.se  only  strsteg^’'  a_  or  d.  Then  and  only  then  does  this 
test  permit  inrprovement  on  the  allowable  a priorj.  dcoisioiis,  for  all  L.R. 
greater  taar  1. 


Reduction  of  Costs 


an  approacln  is  available  which  makes  tests  useful  even  where  ti;ey  '-’iyt 
otherwise  not  contribute  to  utility.  Suppose,  usiijg  ;-hc  test  for  screening, 
we  estimate  S.R.  to  be  40:1.  Then  the  test  is  not  more  advantageous  than 


rejecting  all  men,  supposing  that  this  is  allo\7ed. 
however,  the  test  may  nonetheless  be  useful , 


If  can  reduce  E.R., 


xhis  ratio,  (e^,^  - ne 


nn 


" ^nd-'. 


can  be  reaucea  in  rnese  ways: 


by  raising  erir. 

by  raising  (decreasing  the  cost  of  misr'oc) 


by  iov.’ering 

by  lowering  ®nd.  vinci'ecisiiij^  ^hc  ccst  of  farce 


■-'.r'.c  *1  + *5  ^ 

— — • — / 


If  we  put  the  men  called  normal  by  the  test  through  a further  screening  pro- 
cedure having  some  validity,  we  divine  the  apparent  normals  into  accepts- 
still-calieu-accept,  and  accepts -now-calied-reject . This  detects  some 
deviates  who  would  originally  have  been  missed.  Thus  v;e  raise  e^ri.-  the 
'.'verage  val'.-e  of  the  group  of  deviates  cal]-ed  normal  'ey  the  original  test. 
The  second  screen  lowers  and  svjj.,  by  the  coat  of  the  screening  procedure, 
which  must  thexefore  not  as  tco  great.  Nc  st’-ategic  change  to  raise  to 

lower  or  to  lower  e^j^j  suggests  itse.lf. 


The  relative  utility  curves  (figure  l8)  show  that  there  exists  some  E.R, 
for  which  RV  is  maximal.  Theoretically,  at  least,  for  any  testing  situation 
it  Is  possible  to  adjust  E.R.  bo  the  value  where  the  test  is  maximally  ef- 
fective, as  judged,  by  RV»  It  is  ordinarily  more  reaf-O'iobl e,  however,  to 
devise  a test  to  fit  the  situation. 


The  foregoing  pai-agi’aph  suggests  that  when  the  cost  of  a psychiatric 
Interview  for  everyone  is  prohibitive,  and  when  E.R.  is  high.,  it  is  advisable 
to  use  the  psychiatrists*  limited  time  to  rescreen  the  men  accepted,  not  the 
raen  judged  to  be  deviates  by  the  test.  A test  might  be  of  no  utiij.ty  when 
employed  to  make  a final  accept  - reject  decision,  if  the  cost  of  misses  is 
very  high.  But  by  emnloying  a second  acree.i  to  reduce  Z-R.  by  cutting  the 
cost  of  misses,  we  move  into  a region  of  Fig.  15  where  the  test  can  make  a 
positive  contribution. 


be 

of 


This 

■profile 


is  consistent  -with  the  geneiai  principle  that  a fall 
ibly  cr.e"!  in  •?  lc'nr.51. iv^  dpc'-inri  which  can  be  revere. 
• evidence,  even  tho'ugh  the  test  is  too  ^undependacie; 


;..d  on  the  basis 
to  use  in  final 


decisions . 


63 


Conclusions 


Thi?  paper  has  demoMstrstad  hov  utility  ar.o.I"- 
wheciiar  the  i^RC  screening  test  contributes  to  the 
psychiatric  screening  of  recruits.  V/o  sho-w  that 


^ ^ ^..-1  jx 

■ u.  kO  iy  V' 

goodness 


1 > i Ve  3 1 i.  gti  t c 
cf  decisions 


in 


a.  The  optimal  cutting  score  can  be  rigorously  determined,  if 
data  on  costs  of  various  declsicn.s  ore  obtained. 

b.  A test  which  allovs  better-than-chanoe  decisions  may  not  permit 

o"rQv  "t li&  n-0S"t  s prio^rl  imisss  siXov.^slDl.0 

strategies  are  markedly  restricted. 

c.  For  any  particular  cutting  score,  there  an'e  some  evaluation 
ratios  at  which  the  rest  makes  no  contribution. 

d.  A test  may  have  value  for  preliminary  screening  even  t.ho’jgh  it 
does  not  have  utilitv  if  employed  as  a final  screen. 

lliese  conclusions  are  based  on  a treatment  of  the  criterion  as  e dichotomy. 

Cur  purpose  has  been  to  illustrate  how  utility  theory  makes  explicit 
many  asGumpoions  and  relations  not  readily  recognized  heretofore.  Ue  have 
seer,  that  tn'King  evaluations  into  acco'ont  permits  sounder  analysis  tljari 
correlational  treatment,  or  an  assumption  that  false  positives  and  misses 
have  equal  value.  Furthermore,  in  judging  the  usefulness  <~if  e te?t,  one  must 
consider  how  it  is  Lo  be  employed. 

The  enrphasis  on  a "dollar  criterion"  makes  clear  tliat  studies  of  test 
validity  must  in  the  future  give  conciderable  atleiiticn  to  cost  data,  in 
order  to  determine  evaluations  as  exactly  as  possible.  Seat -of -the -pants 
estimateo  of  evaluation  ■catloa  --  or  even  worse,  blind  assi'mptions  that  all 
errors  nave  enuni  cost  --  can  lead  to  quite  costly  errors  of  judgment  in 
deciding  whether  to  use  a test,  whether  to  use  it  for  preliminary  or  final 
screening,  arid  what  cutting  score  to  employ. 


64 


Ref  enmces 


2. 


4. 

5- 

6, 


10. 


11. 


12. 


13  < 


T4 . 


15. 


Angeii,  G.  r.  The  effect  of  imHodigtf.  knovledce 
final  examinaticn  scores  in  fTt.R'iman  r-hf^riistry, 
42,  391-594. 


of  quiz  results  on 
J.  educ.  Ees.,  1249, 


Barry,  J.  R.,  and  Raynor,  G.  H.  Psychiatric  screening  of  flying 
personnel : research  on  the  Cornell  Irxdex.  USiiF  School  of  Aviation 


riecicine.  Project  ho.  21-0202-0007, 
Texas,  1953 • 

BechLuldt,  II.  V.  Selection.  In 
e:cperimental  psychology . Nev  Yorh: 


Report  ho.  2,  Randolph  Field, 


r> . Stevens  fed.),  handbook  of 
v.-iley,,  1?5>  ' Ff-  1257-1266. 


Brogden,  H.  E.  On  the  interpretation  of  the  correlation  coefficient  as 
a icec.sure  of  predictive  efficiency.  J.  Eauc-  Psychol.,  37,  65-76. 


nrugaen,  r., 

TV!-"'  S'? 

J-  • _ -0.0^  . 


w'hen  testing  pays  off. 


Personnel  Psychol.,  191-9,  2, 


Brcgden,  H.  E.,  and  Taylor, 
cost  acc.-unting  concept  to 
lot-o,  c, 


E.  K.  The  clcllar  criterion- -applying  the 
criterion  construction.  Personnel  Psychol., 


Cooaibs,  C.  H.  Mathecatical  nodeis  in  psychological  scaling.  J.  Aaer. 

RteT.i.sr.  Assn.,  1951,  4o,  ';GO-409. 

Coombs,  C.  K.  .A  theory  of  psychological  scaling.  Artn  Arbor,  Mich: 
Univer.  of  ilieh.  Press,  1952.  Engineering  Resv=;arch  Bulletin  No.  34. 

Coombs,  C.  H.  On  the  use  of  objective  examinations.  Educ.  Psychol. 
Measmt.,  1953,  13  (2),  308-3IO. 

Conrad,  H.  Information  vrhich  should  be  provided  by  test  publishers  and 
testing  agencie.s  on  the  validity  and  use  of  their  tests.  Proceedings, 
1949  invitational  conference  on  testing  problems.  Princeton;  Educa- 
tional Testing  Service,  1950.  ?p.  63  60. 


Cronbach,  L.  J.  A generalized  psychometric  theory  based  cn  ii:formation 
measure.  Urbana,  111.:  Evu’eau  of  Research  and  Service,  College  of 
Education,  University  of  Illinois,  1952.  (Rilmeographed) 

Cureton,  E.  E.  Validity.  In  E.  F.  Lindquist  (ed.).  Educational 
meaoureritent . v/ashingtoxi,  D.  C.:  American  Council  on  Education,  1951» 

Pp.  621-694. 

Dailey,  C.  A.  The  practical  utility  of  the  clinical  report. 

J . consuix . Ps?/chol . , 1953,,  l”^?  297-302. 

Evai'.s,  R.  IT.  A suggested  use  of  sequential  analysis  in  performance 
acceptance  testing.  Urbana,  111.:  Univer.  of  111.,  1953'  Technical 
report  imder  OrIR  contract  U6ori -07142.  (Mimeographed) 

Ferguson,  G.  A.  On  the  theory  of  test  discrimination.  Psychometrika , 

1949,  14,  61-68. 


65 


16. 

17. 

18. 

19. 

20. 
21. 

22. 

23- 

2k. 

25. 

26. 


30. 


risher,  R.  A.  Theory  oi  stanistical  estimation.  Proo.  Camb-rld^c 
Philos.  See.,  1925,  22,  TOC -72 5. 

Gage,  W.  L.  Judging  iiiterests  fx’om  e>:pres3ive  behavior,  Psychol . 

Monogr . , 1952,  6b,  Ko.  i6  (VJhole  no.  3?0) . 

Garner,  J.  P..,  and  Hake,  H.  W.  Ths  emcrjit  of  information  in  absolute 
.iudgmenr s . Psychol,  Hev.,  1951,  58,  446-459* 

Hick,  \'J.  E,  Information  theory  and  intelligence  tests.  Brit.  J.  Psychol. 
(Statist.  Section),  1951,  157-164. 

Kelly,  E.  L.,  and  Flske,  D.  V.’.  Thejprediction  of  performance  in 
cXirical  psychology.  Ann  Arbor,  I'ich.  : Univer.  of  Midi.  Press,  1951* 

McGill,  J.  Multivariate  transmission  of  information  and  ii-s  relation 
to  analysis  of  variance.  Cambridge:  Mass.  Inst,  of  Tech.,  1953 • Human 
Factors  Cperatlons  Research  Laboratori??  Report  Kc,  32,  Research 
Laboratory  of  ElectrnTvf  ,-p  . 

Miller,  G.  A,  VJhat  is  inrormatiun  measurement?  Amer.  Psychologist, 

1953,  6,  3-11. 

Pollack,  I.  'The  assimilation  of  sequentially-encoded  information. 

I:  Methodology  and  an  illustrative  eoqieriment . V/ashington:  Bolling  Air 
Force  Ease,  19‘32,  Human  Resoui’ces  Research  Labaratcrie.s  Memo 
Report  No,  PSr 


Qusstler,  R,  (ed„)  Informatlori  theory  in  biology.  Urbana,  Ill«: 
Univer..  of  111.  Press,  1953* 

Sharnon,  C.  E.  ComnnT.icaticn  in  the  presence  of  noise.  Proc.  Inst. 
Phciio  Eng. , 1949,  37,  10-  21. 

Shannon,  C.  k.,  ana  'leaver,  Vl . The  mathematlual  theory  uf  r:ommur..lca- 
tion . Ur  I ana , 111 . : Univer , of  13.1 . Press , 19^9. 

Shipley,  W.  C.,  and  Graham,  G.  ,H.  Final  report  in  summary  of  research 
on  the  pc.;tiuadl  inventory  and  other  tests.  Aipplied  Psychology  Panel, 
Project  N-113,  Report  No’  10.  OSPD  Rep.  No.  3963;  publ.  Bd.  No.  12C60. 
Washington:  U.  S.  Dept.  Cormerce,  1046. 

Stalnaker,  J.  M.  Personnel  placement  in  the  armed  forces.  J,  Appl. 
Psychol.,  1945,  29,  338-345. 


Tliurlov,  W.  Rc  direct 

J .............  ...n  .-4 

^CXiWXinrrui  Ujr  jJo;/ Uxao  j.0^.4. 


meaoiires  of 

4- _ .-.4- 

.-WW.4. 


discriminations  among  individuals 


1. . .3  •\r\r*r\  /^Qn  oil* 


Von  Neumann,  J.,  and  Morgeustern,  0, 
beriavior . Princeton:  Princeton  Univer 


Theory  of  games  and  economic 
. Press,  "1947.  ^ 


31 


Wald,  A.  Statistical  decision  f'unctjons.  New  York:  V/iley,  1950.. 


